CIEQSFTTLFACQTAAEIWRAFGYTVKIMVDNGNCRLHVC: these forty letters are a set of instructions for building a sophisticated medical device designed to recognize the flu virus in your body. The device latches onto the virus and deactivates the part of it that breaks into your cells. It is impossibly tiny—smaller than the virus on which it operates—and it can be manufactured, in tremendous quantities, by your own cells. It’s a protein.
Proteins—molecular machines capable of building, transforming, and interacting with other molecules—do most of the work of life. Antibodies, which defend our cells against invaders, are proteins. So are hormones, which deliver messages within us; enzymes, which carry out the chemical reactions we need to generate energy; and the myosin in our muscles, which contract when we move. A protein is a large molecule built from smaller molecules called amino acids. Our bodies use twenty amino acids to create proteins; our cells chain them together, following instructions in our DNA. (Each letter in a protein’s formula represents an amino acid: the first two in the flu-targeting protein above are cysteine and isoleucine.) After they’re assembled, these long chains crumple up into what often look like random globs. But the seeming chaos in their collapse is actually highly choreographed. Identical strings of amino acids almost always “fold” into identical three-dimensional shapes. This reliability allows each cell to create, on demand, its own suite of purpose-built biological tools. “Proteins are the most sophisticated molecules in the known universe,” Neil King, a biochemist at the University of Washington’s Institute for Protein Design (I.P.D.), told me. In their efficiency, refinement, and subtlety, they surpass pretty much anything that human beings can build.
Today, biochemists engineer proteins to fight infections, produce biofuels, and improve food stability. Usually, they tweak formulas that nature has already discovered, often by evolving new versions of naturally occurring proteins in their labs. But “de novo” protein design—design from scratch—has been “the holy grail of protein science for many decades,” Sarel Fleishman, a biochemist at the Weizmann Institute of Science, in Israel, told me. Designer proteins could help us cure diseases; build new kinds of materials and electronics; clean up the environment; create and transform life itself. In 2018, Frances Arnold, a chemical engineer at the California Institute of Technology, shared the Nobel Prize in Chemistry for her work on protein design. In April, when the coronavirus pandemic was peaking on the coasts, we spoke over video chat. Arnold, framed by palm trees, sat outside her home, in sunny Southern California. I asked how she thought about the potential of protein design. “Well, I think you just have to look at the world behind me, right?” she said. “Nature, for billions of years, has figured out how to extract resources from the environment—sunlight, carbon dioxide—and convert those into remarkable, living, functioning machines. That’s what we want to do—and do it sustainably, right? Do it in a way that life can go on.”
If there’s one scientist who seems closest to finding that grail, it’s David Baker, a fifty-seven-year-old biochemist with a boyish face and unruly, gray-tinged hair. For three and a half decades, he has used a combination of biochemistry and computer science to develop techniques and software that now define the field of protein design. Baker was born in Seattle and grew up with a literary bent. His parents, both scientists, recalled a Christmas vacation during which Baker and his two sisters had “heated discussions” about Dostoyevsky. As an undergrad, at Harvard, he focussed on social studies and philosophy before switching his focus to biology. In one class, he learned about protein folding and proposed writing a paper about it; he demurred after his professor told him that the subject was too hard.
After Harvard, he biked across the country to the University of California, Berkeley, where he became a graduate student in the lab of Randy Schekman, a cell biologist and future Nobel laureate who was trying to define the steps in protein secretion. Schekman’s lab had been working on the problem for ten years. Within two weeks, Baker, developed a method that allowed them to understand a major part of the process. “It changed the style and the substance of our work for the next twenty years,” Schekman said. In 1993, after a postdoc at the University of California, San Francisco, Baker moved back to Seattle, taking a job at the University of Washington. In 2012, he founded the I.P.D. On September 10th, he won a three-million-dollar Breakthrough Prize—an award endowed by Yuri Milner, Mark Zuckerberg, Anne Wojcicki, Jack Ma, and other tech luminaries—for his work with proteins.
During the pandemic, Baker and other researchers at I.P.D. have turned their attention to proteins that might help in the fight against covid-19. Like many people, they have found their routines transformed. “I do thirty-minute calls with my students, from early morning to late at night, and if I stay at home on the computer doing it I go absolutely insane,” Baker said. Instead—until wildfires covered much of the West Coast in smoke—he talked while walking through the park. Recently, while walking, he reflected on the results of their coronavirus efforts, which have been promising enough that the lab has begun publishing its innovations. “The fact that we were able to come up with things that look like they could be very effective diagnostics and therapeutics and vaccines in such a short time, completely from scratch—it was a bit of a moment where I thought, This stuff could really be useful in the intermediate term,” he said. “If you could make effective therapeutics almost immediately after a new threat emerged—that’s what I’m really excited about now.”
Protein design is hard for lots of reasons. Evolution has had billions of years to explore, by trial and error, the combinatorial possibilities of amino acids. We don’t have the time or resources to throw that much spaghetti at the wall. Imagine that you’re trying to design a protein by trial and error. Some proteins have ten or twenty amino acids; others have thousands. Say that yours is a hundred amino acids long—that means that you have twenty choices for the first amino acid, twenty for the second, twenty for the third, and so on. That’s twenty-to-the-hundredth-power possible combinations—a number so large that it eclipses the quantity of atoms in the visible universe. To design a protein, therefore, it helps to have some sense of the parts out of which they’re usually constructed—the molecular equivalent of wires, motors, hinges, and bolts.
We also need to understand how the parts are assembled. A protein’s components aren’t manufactured separately and then snapped together. Instead, they emerge as a protein chain folds up, more or less instantaneously, into a complex shape. A number of forces shape how proteins fold. In a phenomenon known as hydrophobia, some amino acids eschew water; they end up buried in the interior, with the rest of the protein folded around them. “Polar” atoms attract and repel one another, like magnets. Hydrogen atoms bond tightly to other elements. Like a golf ball rolling downhill, a protein seeks the lowest possible state of “free” energy. In its resting position, its chain might double back on itself many times, perhaps forming sheets and coils.
Protein researchers speak of the “folding problem”—the challenge of predicting ahead of time what shape a chain will take. Nature solves the folding problem easily, using the ultimate parallel-processing computer: the universe. In the real world, every particle interacts with every other particle simultaneously. But human-built computers, which make most calculations sequentially, struggle to simulate this process. Given a simulated protein—rendered onscreen as a rainbow-colored wad of ribbon, or as a bunch of grapes—a piece of software might attempt to calculate how different folds will affect the protein’s free energy. The idea is to fold the protein in a consistently downhill direction. But finding the steepest path on such complex terrain is tricky. Sometimes it’s not even clear which way is down. A computer might bring the folding to a stop when, in fact, there is further to go—as though the simulated golf ball has become trapped in a divot from which a real one might easily escape. The software must sometimes cheat a little: picking up the ball and moving it, to see if it wants to get rolling again.
The most sophisticated program for modelling protein folding is called Rosetta. Baker and his graduate students started writing it in 1996; it looks like a video game crossed with a programming environment, with images of proteins filling some windows and complicated code scrolling in others. Rosetta is open source, and runs on a variety of platforms. It’s now used by hundreds of academic labs and companies around the world, all of whom contribute to the code, which is millions of lines long. Baker, who is not a top-shelf coder, doubts that any of his own code remains: in the early days, comments left next to his contributions would identify them as “crazy Baker stuff.” Still, Sarel Fleishman said, “David’s lab and David himself have been incredibly dominant in this field. Dominant not in the sense of fending people off—it’s actually the reverse. It’s about openness.”
Protein folding has obvious commercial applications, but Rosetta is mostly free. “One of the good choices early on was that no individual would ever make any money directly from it,” Baker told me. The funds generated from corporate licenses go into a pot guarded by a nonprofit called RosettaCommons; some of the money pays for RosettaCon, an annual summer gathering of protein folders traditionally held in August, in Leavenworth, Washington, a mountain town about two hours away from I.P.D. This year, the pandemic upended tradition, and the meeting was held virtually. Meanwhile, in April, a couple hundred researchers convened an early, online meeting, to discuss covid-19. “A lot of us have been talking about the idea of feeling called to work on covid during this time,” Rebecca Alford, who completed her Ph.D. at Johns Hopkins, in June, told me. The fact that so many protein designers use Rosetta has made impromptu collaboration easy. Alford said, “You can ask someone in California or in China, ‘What do I do with this piece of code?’ ”
Protein-folding software has two main components: a “sampling method” and an “energy function.” The sampler tries different starting places for the golf ball; the energy function aims to direct it downhill. From the beginning, Rosetta, drawing on Baker’s lab experiments, was good at both tasks. It successfully predicted protein folds. But it achieved its singular position in the field because of tweaks and additions made, over the years, by the larger community of researchers, which honed the software’s precision and extended its capabilities. “Every new generation of students is motivated to contribute,” Baker said. “They share in the progress and benefits—including a very luxurious, all-expenses meeting and reunion once a year.”
In the nineteen-seventies, the pioneers of protein design worked by building physical models of their amino-acid chains. William DeGrado, a biochemist at the University of California, San Francisco, coined the term “de novo” protein design in the nineteen-eighties; he recalled, “I was told it was going to be impossible quite a bit.” Protein design is a two-way street: you must figure out how to predict a shape from a sequence and also find the right sequence for a desired shape. It’s a give-and-take, with the overarching goal of finding a shape that does something useful, such as binding, antibody-like, to a virus. A protein designer might start by taking natural proteins and tweaking them. She might also use a system of directed evolution, in which large collections of proteins are tested, selected for certain properties, and then mutated, over and over, until the right traits emerge. (Refining this process is what won Arnold her Nobel Prize.)
Thanks to improved computational tools, including Rosetta, and faster methods for making and testing proteins, de-novo design has begun to show real promise. “It’s amazing how much progress has been made, and how it’s just accelerating so rapidly,” DeGrado said. Baker agreed that progress was speeding up. “The fact that we’re spinning out a couple of companies a year is kind of remarkable,” he said. His lab’s work on covid-19 has convinced him that the grail is almost within reach. “The hope is that the next time there’s an outbreak, within two days, we’ll have models of candidates,” he told me.
Broadly speaking, new advances in protein design have clustered in three main areas. The first is “binding”—the construction of proteins that adhere tightly to biological targets. In May, I spent a Friday night video-chatting with Inna Goreshnik, a research scientist at I.P.D., as she carried out part of an experiment with Longxing Cao, a postdoc. (I.P.D. occupies the top two floors of its building, and is home to around a hundred and thirty scientists, seventy of whom work in Baker’s lab.) Goreshnik stood at a lab bench in a striped sweater and face mask. “This is very stressful,” she said, as she carried out the calculations needed to prepare the samples. “I usually don’t have anyone watching me do math.”
Their target was sars-CoV-2, the coronavirus that causes covid-19. Earlier, Cao had identified a vulnerable spot on the virus’s spike protein—a kind of grappling hook on its outer shell which enables it to invade cells. His goal was to design “binder” proteins that would adhere to that particular spot on the spike, thereby disabling its function. Rosetta contained a precise model of the spike; Cao had written scripts that used that model to generate, de novo, binders that might work. It was as though, given the measurements of a hand, Rosetta were designing a glove. The program ended up suggesting nearly a hundred thousand possible binders, most between fifty-five and eighty-eight amino acids long. For a few thousand dollars, Cao hired a biotech company to produce DNA strands—synthetic genes—that could instruct cells to build those binders. He then introduced each synthetic gene, encoding a unique binder, into a different yeast cell, and, once those cells had manufactured the binders, added the viral spikes. To see if the binders had attached to the spikes, he ran the cells past a laser, one by one, looking for subtle signatures in their fluorescence. A few of the binders did pretty well.
This was the process’s first step. In the second, Cao subjected the most promising candidates to “site-saturation mutagenesis”—a directed-evolution technique. He swapped out the first amino acid of each candidate for a different one, creating nineteen alternate versions. He repeated this process for the second amino acid, then the third, and so on. Then he ordered another batch of DNA that could make these mutated proteins, and tested them. Certain single-site mutations worked better than others; he created a third set of proteins, combining the best ones. These proteins were what he and Goreshnik were about to produce. During our video chat, Goreshnik held up two small tubes containing white powder: the dried DNA strands. Cao raised a flask of yeast cells, into which the DNA would go.
For around three hours, Goreshnik mixed the DNA fragments with other chemicals, then ran them through a PCR machine, which multiplied and sewed them together. She purified the results, then multiplied and purified them again. “There’s lots of walking and a lot of pipetting,” she said. Eventually, she showed me a small container: “All that work, and at the end we get just thirty microlitres of liquid in a tube,” she said. Later that night, Cao would introduce the DNA to the yeast cells, which together would make the binding proteins over the course of the next twenty-four hours. Goreshnik and Cao hoped that, in addition to making proteins that bound to sars-CoV-2, they could refine their process so that more of it could be done with Rosetta. “The final goal is just to order one design, and it works,” Cao said. Ideally, the de-novo protein wouldn’t just bind to its target strongly and specifically—it would do so in exactly the way predicted by the software.
A similar process was used to create the flu-binding protein described at the beginning of this article; it was first revealed in a paper published in Nature, in 2017. The process was also used to design Neo-2/15, a cancer drug being developed by a company called Neoleukin, which Baker spun off from his lab (and in which he retains an ownership stake). Neo-2/15, the de-novo protein design currently closest to coming to market, is a new version of a signalling molecule called interleukin-2 (IL-2), which is created naturally by the immune system. IL-2 attaches to receptors on white blood cells, supercharging their response. Certain kinds of cancer patients can benefit from high doses of IL-2, but the treatment carries risks: the molecule binds to three different receptors, and one of those, if overstimulated, can unleash a toxic response in the body. Researchers have tried using directed evolution to mutate IL-2 so that it binds only to the nontoxic receptors; it hasn’t worked. Last year, Baker and his collaborators used Rosetta to design a new protein with the desired binding. Their protein is only distantly related to human-produced IL-2, and has successfully treated mice with skin and colon cancer.
Baker has likened de-novo protein design to the jump from the Stone Age to the Iron Age: instead of carving tools out of whatever we find in nature, we’ll be able to cast our inventions in whatever shape we wish. I asked him how close we were to the Iron Age. “The test is going to be later this year,” he said, referring to the clinical trials for Neo-2/15. “Then we’ll really see what de-novo-designed proteins do inside people.” Recently, Science published the results of the study I’d observed over video chat. Two of the sars-CoV-2 antiviral proteins that the group had designed were several times more potent than the best monoclonal antibodies currently in development.
The second main area of progress in protein design has to do with self-assembly—the creation of small proteins that join together to make something larger. Here, too, I.P.D. has made a contribution. In a paper published in Science, in 2016, Baker’s lab reported the development of a protein-based icosahedron—a twenty-sided geometric shape, like a die for Dungeons & Dragons. The icosahedron was built from twenty “trimers” and twelve “pentamers”—proteins made of three and five smaller proteins, respectively. The component proteins had been built by bacteria, according to DNA instructions; they were then dissolved in a solution and, while floating around, joined together of their own accord, to create the symmetrical forms that Rosetta had predicted. A protein with such a shape—which is easy to build and roomy inside, with many useful vertices—could carry medicinal cargo through the body; it could also be studded with bits of virus, and, therefore, become a vaccine. (Immunologists have found that, when antigens form a repeating pattern—as they would on the surface of an icosahedron—they tend to stimulate a stronger immune response.)
Last year, Neil King’s lab at the I.P.D. produced such a vaccine: an icosahedron, or “nanoparticle,” arrayed with proteins from respiratory syncytial virus (R.S.V.), the leading cause of infant mortality after malaria. In animals, the new vaccine was ten times as effective as one in which viral proteins floated freely, on their own. A spinoff company, Icosavax, is now developing the R.S.V. vaccine further, with fifty-one million dollars in Series A financing; King, who was a postdoc in Baker’s lab and now leads the I.P.D.’s vaccine efforts, advises Icosavax. (Both he and Baker retain ownership stakes.) He is also working with the National Institutes of Health to use the same technology for a universal flu vaccine and a vaccine for sars-CoV-2. Last month, on the Web site bioRxiv, he posted a “preprint”—a paper that has not yet been peer-reviewed—on the first sars-CoV-2 results. The lab had vaccinated mice with a self-assembling protein nanoparticle on which sixty copies of the key part of the coronavirus’s spike protein had been embedded; in response, the mice produced ten times as many antibodies as they’d made when given a vaccine containing spike proteins alone. The antibodies made in response to the nanoparticle were also more powerful: they targeted multiple spots on the spike.
Vaccines aren’t the only molecular tools that can be self-assembled. In another project, the results of which were published last year, in Nature, Baker’s lab designed proteins that align with ions on the surface of mica to form a honeycomb pattern. Scientists think that such a latticework could act as a water or air filter. But the process—in which a mineral substrate is used to assemble proteins in an ordered way—could also be reversed. “We want to flip it over, and use a protein scaffold to control the assembly of a mineral,” Harley Pyles, a postdoc in Baker’s lab, said. Such a scaffold could allow scientists to turn calcium carbonate, also known as limestone, into an environmentally friendly replacement for cement, or to transform zinc oxide—used often in lotions, food supplements, and plastics—into a material for solar cells. I asked Pyles if all projects required as much trial and error as the medical molecules. “Some problems are more push-button at this point,” he said. Building a protein that binds to a virus is hard; the virus has evolved to be slippery. If you control both sides of the interaction, however—designing proteins that bind to each other—you can move much faster.
The third area of progress has to do with functionality: the creation of proteins with flexible, moving parts. Sarel Fleishman uses Rosetta to design proteins defined not just by their shape but by their functions. They are too large to design from scratch; he builds them by changing designs found in nature. Still, the proteins contain so many mutations—sometimes more than a hundred each—that they no longer resemble anything that biochemists might find in a regular cell; if the proteins were houses, they would have been thoroughly renovated. Recently, Fleishman’s lab redesigned a naturally occurring enzyme that breaks down nerve agents, such as cyclosarin and Russian VX, similar to the one used against the Russian opposition leader Alexey Navalny. The original enzyme, PTE, is too slow to be of much use, so Fleishman’s lab ran it through an algorithm that it developed called pross (Protein Repair One Stop Shop), which figures out how to redesign proteins so that they’re more stable and effective. Most proteins designed through evolution occur in a number of variations; pross analyzes the variations to find the most common amino acids at each of their positions. The algorithm, working on the theory that the more commonly evolved variants have greater stability, then uses Rosetta to arrive at an improved version of the protein. After applying pross to the PTE enzyme, Fleishman’s lab used another of its algorithms, FuncLib, to select the best candidates for testing. In animals, the resulting proteins were thousands of times more efficient than PTE at metabolizing cyclosarin—fast enough to be useful in the real world. (Proteins that have been stabilized become not only more powerful but hardier: Fleishman’s lab has also used pross to improve a malaria vaccine so that it can hold its form without refrigeration.) Fleishman runs pross and FuncLib on academic Web servers that anyone can access; other labs are now using the software. “We see papers coming out from labs I’ve never heard of, and working on enzymes I’ve never heard of,” he said. People plug in problems they’ve been attacking for a decade, and the algorithms just work.
Scott Boyken, a former postdoc in Baker’s lab, has designed, from scratch, proteins with tiny moving parts. One project, published in Science last year, tackles the problem of “endosomal escape.” When a drug or protein enters a cell, the cell wraps it in a membrane called an endosome; the endosome makes it harder for the drug to penetrate the cell’s inner reaches. “That membrane barrier has evolved for the last three billion years to prevent you from crossing it,” King told me. “It is a formidable challenge. But it’s one that we know proteins can solve.” Some viruses, including sars-CoV-2, cross the endosomal barrier to replicate.
To create his moving proteins, Boyken used a Rosetta module he’d written called HBNet, which allows him to build with hydrogen bonds, chemical connections that are sometimes sensitive to acidity. Because endosomes create an acidic environment, Boyken designed a protein that cracks open only at a particular pH. When the protein opens, it unsheathes molecular coils that have the ability to disrupt the endosomal membrane. A drug attached to such a protein could slip through the bars of the intracellular prison.
In more speculative work, published in Science in April, Boyken and his colleagues designed protein logic gates—equivalents to the “AND,” “OR,” and “NOT” gates at the heart of computer circuitry. They have also created a protein-based equivalent of swinging robotic arms, which they can use to precisely program new biological functions into cells. In work published in Science last month, they used the arms to build logic gates and switches on the surfaces of cells; the system looked for combinations of signs which suggested that the cells were cancerous—molecules A and B, but not C—and, if it found them, summoned an immune response. “This is a whole new field,” Boyken said. “It’s going to revolutionize how we engineer biology.”
The field of protein design has been built by a vast community of scientists, each contributing a part. In 2005, Baker expanded the size of the community by releasing a program called Rosetta@home, through which he invited anyone to help solve the protein-folding problem. Download it, and your computer can use its spare C.P.U. cycles to process proteins, displaying its work in progress as a screen saver.
In the year after its release, thousands of people downloaded Rosetta@home. Some grew frustrated: watching their screen savers, they spotted folding solutions that their computers stubbornly missed. Human beings, it turned out, have intuitions about the 3-D world that software lacks. In response, in 2008, Baker added a new feature to Rosetta: a game called Foldit, in which players compete to see who can predict a protein’s fold the best. In one competition, four hundred and sixty-nine Foldit players from around the world predicted a protein structure better than a class of undergrads, a pair of trained crystallographers, and the Rosetta folding algorithm.
Recently, I downloaded Foldit, which presents a rotating set of challenges to players, and started working my way through the introductory puzzles. The program showed me an increasingly complex set of three-dimensional structures, with bundles of ribbon, corkscrews, and wire representing the parts of proteins. The structures were not completely folded; little animations indicated protrusions that feared water and had to be rotated inward, or internal clashes that had to be resolved. By clicking and dragging, I could bend the over-all backbone this way or that, or fiddle with the little sidechains. I could also click buttons to “shake” or “wiggle” the molecule, perhaps jostling the golf ball into a lower energy state. A score indicated how close I was to perfection. I found Foldit frustrating, like chipping a ball around a course without knowing where the hole was. Still, I could see how others might find it engaging, even addictive, like a Rubik’s Cube, or golf.
Last year, Baker’s lab reported that Foldit players were also adept at designing proteins. The program had given them a monotonous chain of one repeating amino acid. They could fold it up, or replace any acid with a different one—essentially, they could throw spaghetti at the wall. There were very few constraints on what they could design; the challenge, of course, is that a vast majority of sequences don’t fold up into a stable structure. In the end, four thousand players designed fifty-six proteins deemed successful. In doing so, they used a more diverse set of exploration strategies than Rosetta uses on its own. “Protein design is this very open-ended problem,” Brian Koepnick, the scientist who runs the Foldit project, said. “The creativity of citizen scientists can do things that we can’t do with normal protein-design programs.” Since the pandemic began, Foldit, has asked them to predict the shapes of various coronavirus proteins, and to design new binders for coronavirus targets. Usage has skyrocketed. Some power users have written scripts to automate parts of the design process. I.P.D. is now testing some of their work in the lab.
Other kinds of tools are also being brought to bear on protein design. “My personal belief is that the future belongs not to Rosetta but to machine learning,” Frances Arnold told me, from her back yard. For ten years, she’s been designing proteins through directed evolution and selecting the most promising sets of mutations; now she’s developing A.I. to do the selecting. Baker, too, has been experimenting with neural networks. Rosetta works by painstakingly calculating its way downhill. An alternative, he said, is to train a neural network “on a very large protein-structure database”—to “have the network learn what proteins look like.” Recently, he said, “We’ve been able to generate brand-new protein structures that look pretty compelling using deep-learning generative models. But those are at the very early stages. I think, moving forward, there will be a very interesting synergy between deep learning and methods like Rosetta.”
The road to pandemic cures, and much else, may be paved by some combination of physics-based simulations, generative neural networks, directed evolution, and hobbyists playing Foldit under lockdown. Arnold notes that proteins—unlike airplanes, bridges, or other engineered artifacts—are almost infinitely malleable. “This special feature of proteins makes it a space for engineering where all these tools can come together in a synergistic fashion,” she said. “That’s why I’m so excited about it. And I’m excited about what David Baker does. Because all these tools need to come together. And, when they do, we’re going to explode in our capabilities for designing the biological world.”
“We’re trying to change the way that technology and engineering on the molecular scale are done in biology,” Baker said, over video chat, while walking through his local park. “Currently, the way that works in biological engineering has all been about making small modifications to what we find in nature. Or else you make completely random collections of molecules and select those that look useful. . . . To be able to create new and useful molecules by first-principles design is a breakthrough.” As we spoke, Baker walked. On my screen, I saw the sun break through the leafy canopy overhead.