🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub — Latent Space

Intro0:00

Alex Rives0:00

So ESM-C is al- is also approaching programmable biology, but I would say in a very different way. It's approaching it from this kind of world modeling perspective, where the idea is basically you have a predictive model and, you know, y- you're gonna search the world model to find protein molecules that satisfy kind of whatever design criteria that you have.

So we've been able to use this to actually now go and design, um, many protein binders.

RJ Haneke0:26

Right. Mm-hmm.

Alex Rives0:26

But I think sort of most excitingly, we've been able to use this to actually design antibodies, scFvs.

RJ Haneke0:34

Hello, welcome to the Latent Space AI for Science podcast. I'm RJ Haneki, CTO of Mira Omics.

Brandon Anderson0:41

Yeah, and, uh, I'm Brandon. Today, it's a pleasure to have Alex Rives, uh, head of science at BioHub. Yeah, would you like to introduce yourself real quick?

Alex Rives0:49

Yeah, yeah. Thank you for having me here. It's great to be here. Um, I'm head of science at BioHub. I'm a computer scientist, uh, and I work on AI for biology, and a lot of my work has been on language models for biology.

Brandon Anderson1:02

By the time this podcast is released, you will have put out several new, exciting, interesting models. Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson-filled person in protein biology right now. Can you give a little context about what that means for biology and, you know, why you're so committed and excited to this route?

Bitter Lesson1:02

Alex Rives1:24

Well, I'll take that. Uh- ... I believe in scaling laws.

Brandon Anderson1:27

Yeah.

Alex Rives1:27

So, you know, I guess I've been working on this for, you know, since, since the summer of 2018. Um, and so my team, when we were at Metafair, trained, uh, really the first transformer language model for protein biology. And so I guess, you know, I, I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token, you know, that evolution creates.

So our team has really explored that idea over a number of different years, and we've really kind of, I think, seen the scaling curve and really seen as we have, have increased models by an order of magnitude, kind of in each generation that, you know, there's this emergence of new capabilities.

Brandon Anderson2:09

Yeah. So you've been, you say emergence of capabilities, scaling over generations. You've been working at this, as you said, for, I guess it would be eight years now, something like that? It didn't always work that way, right? Like, there was signs that scaling might work, you know. We'll be getting to some new results where it-- I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before.

But you seem to have, like, a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. I mean, proteins are not the-- protein language is not the same thing as natural language. There are similarities. You've, uh-- if you start sampling a transformer at, you know, a normal language transformer at, in temperature, you're gonna get gibberish.

You sample a protein t- language model at infertem- infinite temperature, you're gonna get something which is a valid protein, if not a not interesting protein. Despite the fact that it is a different domain for a different reason, I'm not necessarily sure that I would a priori assume the natural language model insight would transfer over. So what is specifically about proteins that you thought was special or, you, you know, that would make this also valid?

Alex Rives3:17

Yeah, I mean, it's a really interesting question, I think, kind of a deep question across AI right now-

Brandon Anderson3:22

Mm-hmm

Alex Rives3:22

... more broadly. And you know, I, I think, you know, what's, what's so interesting is AI right now is, is such an empirical science, and so we don't have, you know, theory that can always guide us in these things, but we have this really strong empirical evidence of scaling. The thing that I was motivated by is, you know, if you think about evolution and, you know, y- you think about the data that we, we have around proteins, we have databases that have billions of protein sequences.

And, you know, those, those sequences contain patterns. And, you know, it had, had been, long been known, so th- you know, this is going back, you know, decades kind of before, you know, we started working on this with language models, but that there are patterns, the sequences of protein families that come there because of the constraints that evo- evolution is operating under.

So you can think about, you know, like a, um, a protein sequence that folds into a three-dimensional structure in space, and you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. And so evolution isn't free to choose those independently from each other.

If it makes a choice at, at one position, it kind of has to make another choice that's gonna be compatible at the next position. So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to, to look at this and l- kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology.

So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle kind of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts.

So you can think that there's just this, this kind of like incredible amount of information in that total picture about the underlying biology of proteins. And so that was really the idea that sparked this, is, is, you know, as, as a model is having to predict the next token, and actually we train these models with mass language modeling, so they're predicting kind of tokens that are masked out of various parts of the sequence, that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose.

Model History5:59

Brandon Anderson5:59

Yeah. So maybe for a bit of history, um, so, you know, you, you have

RJ Haneke6:03

You, you just released, um, evolutionary scale modeling Cambrian, right? Is that what it's called? Yeah. And this is like the maybe fourth or fifth in a series of models, I think maybe even more if you go back before they were called ESM.

Alex Rives6:15

No, they're, they, they were called ESM from the start.

RJ Haneke6:17

Okay, they were called ESM from the start.

Alex Rives6:18

Yeah. We had sort of various branches-

RJ Haneke6:19

Okay. Yeah, yeah

Alex Rives6:20

... of, of the different models.

RJ Haneke6:21

Yeah, yeah.

Alex Rives6:21

Yeah. So, so this one I would say is, is kind of a, a fourth generation model. Um, it's actually a model that we trained a little over a year ago. Now that we're at BioHub, we're, um, we're, we're open sourcing this, this model fully under MIT license for the first time. So we're really excited to do that.

But kind of the, the big thing that is new here is that we've really kind of built a world model of protein biology. So the foundation of that is ESMC, but w- you know, using the representations of ESMC, we've kind of now built a, a structure prediction model, um, and this is the next generation ESMfold model. And then we've also used the techniques of, of, of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology.

So bringing all of this together, we're able to, you know, really make predictions for protein structure, um, predictions about kind of the underlying features that, that proteins are made out of that allows us to build linkages across evolution. We're able to take this model and invert it to design proteins, and we've, we've, we've used this to kind of create a comprehensive picture of protein biology.

So we, we put together kind of all the world's largest protein sequence databases, and so that kind of amounts to 6.8 billion non-redundant proteins, and then we've, we've resolved predicted structures for 1.1 billion of those. And, and we've also computed features across all of those so that we can make these linkages basically all across, um, evolution and protein biology.

RJ Haneke8:06

Right. 6.8 billion of which you've r- resolved structure for 1.2, is that right?

Alex Rives8:12

1.1.

RJ Haneke8:13

1.1.

Alex Rives8:13

Yeah.

RJ Haneke8:14

So what about the others?

Alex Rives8:16

Well, so, so basically what we did is we took that database and we clustered it at 70% sequence identity. So it's, it's really resolving structures for everything in the sense that for each cluster, we kind of have a cluster center, we're predicting the structure there, and then we can expect that the other proteins are gonna have a similar template structure.

RJ Haneke8:35

I see.

Alex Rives8:35

There'll be, be small variations-

RJ Haneke8:36

So they, they're-

Alex Rives8:37

... but they have the same fold

RJ Haneke8:37

... 1.2 billion or so clusters.

Alex Rives8:41

That are, that are kind of covering the 6.8-

RJ Haneke8:43

Yeah

Alex Rives8:43

... billion. Yeah.

RJ Haneke8:44

Okay. Interesting. And you know, maybe w- since we're talking about scaling, how do you know that, um, this is the right number, right? Uh, like, uh, how do you know that focusing on these 1.1 billion, and that's the right resolution for this model?

Alex Rives9:01

Well, we've chosen them so that they really cover that entire space.

RJ Haneke9:04

Mm-hmm.

Alex Rives9:04

So I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. It's adding, you know, hundreds of millions of structures to our knowledge of, of kind of protein, the diversity of protein structure, and it's also creating this, uh, feature space that allows us to find these linkages between proteins across evolution.

So we can see kind of really interesting themes emerging across evolution, you know, linking, for example, um, gene editing s- systems which are very far apart in sequence but, you know, they share some kind of underlying functional, um, patterns, structural homology that the model's able to bring together and, and find those connections.

RJ Haneke9:47

Now we're talking about the mechanistic interpretability part. So you have, if I understand correctly, you use sparse autoencoders and other techniques maybe to understand, okay, what are the-- when I activate the network using a protein, then what are the patterns of outputs that I'm seeing, and how do they relate to each other? If I understand correctly, is that you have these sequences that are unrelated or only partly related based on the actual sequence, but in terms of behavior, they have similar behavior and therefore they are activating similar networks.

Is that kind of the summary of what you just said?

Alex Rives10:22

Yeah. So I mean, basically what we've done is we've trained sparse autoencoders across all the different layers of the ESMC model family. So there's actually three models in that family. There's a 300 million parameter model, a 600 million parameter model, and a six billion parameter model. And then we've looked really, we've done kind of a very deep analysis of the feature space of that six billion parameter model, which is really the state-of-the-art protein language model.

And so w- what we find, what's really interesting is there's, there's kind of this, um, you know, the, this hierarchy of features that emerges. What's really interesting about it is it's, it really kind of corresponds to the reductive picture of biology that has been developed over, you know, ma- many decades, a century of, of, um, biological experiments.

But, but what's so, so cool is this is emerging, you know, without any prior knowledge. It's been learned by the language model. So the, the interesting thing about SAEs, right, is they're really just revealing the intrinsic structure of the representation space. So this model's been trained on protein sequences, it's been trained just to predict the amino acids that evolution will choose and then, you know, somehow this is leading to the emergence of this like kind of very ordered feature space that has this hierarchical structure where you can really see everything from the basic biochemical properties and kind of the basic structural building blocks of proteins to these very large kind of functional themes, these kind of abstract concepts that, you know, connect to, to how kind of the human picture of, of protein function.

RJ Haneke11:58

And do you have a hypothesis or a feel for why-

Brandon Anderson12:03

If there are relationships between the sequences themselves, even if they're, like, shifted and, and cut up and recombined in different ways, like I can imagine that might work because you have these, you know, these proteins are kind of hierarchical in their nature as well. So maybe the hierarchy moves around, but they're the same sequence. But the functional units, I guess, th-those have related structures.

What is the hypothesis here?

Alex Rives12:29

I mean, it's, it's a really interesting question, right? And I think, I think I can speculate about it.

Brandon Anderson12:33

Sure.

Alex Rives12:33

I don't think we, we kind of completely understand this, right? But let, let me give a concrete example. So you know, the nucleophilic elbow is this kind of like core kind of functional motif that, uh, people have thought that, you know, maybe this actually, um, has emerged independently-

Brandon Anderson12:48

Mm-hmm

Alex Rives12:48

... in evolution, you know, different times and different protein families. But it has this, you know, very, very clear, um, structural motif that you can kind of see in a, in a crystal structure. You know, what we found basically is that the model has a kind of a single feature for this nucleophilic elbow, and it's activating across these like very evolutionarily diverse families.

You know, really completely different structural topologies, proteins that probably evolve like entirely independently from each other. But, you know, the model is kind of using this one feature to represent that. So why does it do that? I mean, I, I think it's a really interesting question. I mean, I think one answer is sort of the idea of, of compression and the idea that, you know, the model needs to have some kind of underlying latent variables that it develops to help solve this, this kind of sequence prediction task.

Because the nucleophilic elbow, you know, it's gonna be a function of... What's so, what's so interesting, right, is the choice of any amino acid is kind of like completely entangled with, with the choice of all the other amino acids in the sequence. So this is a very complex task to try to predict what amino acids should be where in a protein.

But to really do this well, you know, the model would start to have to have these kind of hidden variables that are representing the biology that allow it to, you know, look, look at a protein and say, "Okay, what amino acids should be there in all these different contexts?" So you know, that, that's sort of, I think, the intuition.

I mean, I would draw the parallel to language modeling, right? And so I, I guess I was, I was like very influenced by a paper by, um, uh, Zellig Harris. It was called Distributional Structure from, from 1954, and it's, I think a paper that influenced a lot of people in the language modeling field as well, you know.

But I, I think it has-- So it focuses on, on language and, and it really articulates this idea that, um, the set of contexts in which a word appears are determined by the meaning of that word. And so what Zellig Harris kind of imagined is that, you know, it would be, as you looked at the statistical patterns of, you know, what words appear in what context sets, you would be able to derive the meaning of language.

You know, you would, you would have this kind of statistical structure that would mirror the underlying meaning of language. For me at least, that's one of the most convincing explanations for why, you know, a language model that's trained on the text of the internet is going to learn something about meaning. It's gonna learn something deeper and more fundamental.

And so, so I think, you know, you can think about the same thing in, in biology, where the contexts in which an amino acid can occur are really determined by, you know, the structure or the function of the protein, its biological roles. You know, these, I mean, very complex phenomenon, uh, both the intrinsic biology of the protein and its relation to all of the other proteins and the function and evolution.

And so-- But, but those are what determine the context sets. And so you would imagine that then those statistical patterns in the use of amino acids, they directly reflect those underlying hidden variables. And so the model is gonna learn something about those hidden variables.

Brandon Anderson16:03

I definitely buy that seems plausible. Um, maybe just-- I mean, I-- In fact, I, I'm gonna clear. I actually do really believe in this direction, but there are a lot of, like, ways I think about this where maybe I could say, maybe I would imagine maybe it wouldn't work, and one of them is, like, data availability, right?

Data & Scale16:03

Brandon Anderson16:19

Like, what type of data do we normally, uh, have? What type of sequence data do we normally get? I think that ESMC in particular has some new data sources compared to, like, previous models, which might be helpful. But oftentimes, the type of sequences we have available have, like, a very strong bias towards certain specific needs for medicine or, you know, human biology or disease biology, right?

So it's not necessarily that if you take just a naive data set, you're gonna necessarily get an interesting scaling law. So I'm curious about, like, what in particular was the sort of breakthrough in ESMC. So maybe we can go back a bit and talk about some of the other ESM, you know, predecessors, which got here before ESMC, and how, like, you know, they were, you know, their strengths, but also maybe some of the limitations that ESMC overcame and, like, what the developments there were.

Alex Rives17:11

Yeah.

Brandon Anderson17:12

Sure.

Alex Rives17:12

Well, I'll, I'll admit that I am, I am bitter lesson-telling.

Brandon Anderson17:15

Yeah.

Alex Rives17:15

I am scaling-telling. And so I do think that, I mean, you know, just, just kind of increasing the data-

Brandon Anderson17:22

Mm-hmm

Alex Rives17:22

... increasing the parameters and having that compression is, is going to just lead to more powerful models. But you know, it is also true, and I think you're, you're absolutely right, the structure, the underlying kind of structure and distribution of the data is really critical. And so, you know, some data sets will be far more valuable for kind of learning these, these general principles than, than others.

But I think it goes against a lot of biological intuitions about collecting data, I guess is what I'd say, because normally when you think about what data do you want, you're trying to answer a very specific scientific hypothesis. You want, you know, a very well-controlled experiment. You know, you really want multiple replicas. You know, it's, it's something that's very focused, you know, is the way that I would put it.

So I think the change in the way of thinking is to think, okay, what you really want if you want to learn a general representation of proteins, is you want to see amino acids in as many evolutionary contexts as possible. That's really what you want. That's really kind of how I think about data. And I think if you-- what changed between ESM2, which was kind of the previous generation model, and ESMC, which is this new generation model, 'cause they're both at the approximately the same scale, and, you know-

Brandon Anderson18:36

Sorry, the same scale, uh, of compute-

Alex Rives18:38

S- same scale-

Brandon Anderson18:39

Or same scale of parameters

Alex Rives18:39

... in scale and parameters.

Brandon Anderson18:40

Okay.

Alex Rives18:41

Yeah. ESM2 got a lot of compute, but-

Brandon Anderson18:44

Mm-hmm

Alex Rives18:44

... ESMC got even more compute.

Brandon Anderson18:45

More, yeah.

Alex Rives18:46

But it's not just the compute. The data was, was really the critical thing here actually. So when we trained ESM2, we observed two things. The, the first was that as we increase the number of parameters and compute, you know, we saw improvement. So we had a model kind of at the billion parameter scale, we had a model at the 10 billion parameter scale, and, you know, the larger scale model is better than the smaller scale model.

But if you kind of look at a plot that of, of, uh, parameter scale, you know, sort of a log plot of parameter scale versus capability, and so for capability, you know, we're looking at kind of the representational fidelity, how well does it capture protein structure? You could see that there's, there's kinda diminishing returns in ESM2. ESM2 is trained on UniRef.

And for ESMC, we added metagenomics. So we added billions more sequences-

Brandon Anderson19:40

So-

Alex Rives19:40

... to the training data.

Brandon Anderson19:41

Yeah. Could you explain what UniRef and, uh, metagenomics means?

Alex Rives19:46

Yeah. Yeah. So U- UniRef is, um, I'd say sort of the, the gold standard dataset-

Brandon Anderson19:51

Mm-hmm

Alex Rives19:51

... of sequence biology. It's kind of, you know, taking sequences from across a wide variety of different sequencing resources. It's clustering them, you know, to kind of re- remove some of this redundancy that, that you were mentioning. And, and so it kind of creates a definitive coverage of, of, of protein biology. What has happened in parallel to the classical gene sequencing is, is this idea of metagenomic sequencing, where people go out into, you know, all kinds of different biomes and environments and collect samples from the world and just kind of sequence the, the natural diversity that's present there.

So, you know, proteins from a hydro- hydrothermal vent or proteins from a, from a, a frigid en- en- environment near the South Pole, you know, or, or the deep ocean, or, you know, soil, or the human gut. You know, all, all kinds of different environments.

Brandon Anderson20:39

So this is a very different way of collecting data. Instead of you are trying to understand a specific genome of a specific organism, or trying to understand a specific protein, you just collect a bunch of stuff, mix it up in a pipe, get the sequences out. You have no idea what organisms these are from. You don't necessarily even know if a given sequence is a protein, but you can guess based upon certain contexts.

And you say, "Okay, we throw these together. These are likely protein sequences we find. We're not assigning them to an organism. We're not assigning them to, like, a larger context. We're just saying, 'This is probably a protein. Let's train on it.'"

Alex Rives21:12

That is right. Yeah, and you, you don't even get the full genomes. You just get these kind of contigs that often are broken and have even partial proteins.

Brandon Anderson21:19

Mm-hmm.

Alex Rives21:19

So the, the data's really noisy. One more, like, little nerdy question that I have here is that if I understand correctly, you're not actually looking at so- using a device that sequences proteins. You're sequencing the DNA that would manufacture those proteins, so you're finding DNA and then looking for markers that indicate the beginning and end of a protein sequence.

Is that kinda- Yeah, that- ... correct? That's exactly right. Yeah. Okay. Basically sequencing, you know, genetic sequences, and then you translate the, the proteins from those sequences. And so you're digging up, like, sewers and-- Not you. But not me personally. That's not you. But there are scientists- But there are somebody who-

Brandon Anderson21:58

Digging in sewers

Alex Rives21:58

... who are doing this ... like, like probably-

Brandon Anderson21:59

Yeah

Alex Rives21:59

... many thousands of people.

Brandon Anderson22:00

The New York City subway.

Alex Rives22:01

Yeah.

Brandon Anderson22:01

You know, all, all kinds of things.

Alex Rives22:02

Yeah. So the natural question to me is, so you built this model and you think that you've kind of de-duplicated it so that you have a good representational set without a lot of redundancy in it. How much more is there? Like, if we had an order of magnitude more resources, do you think that there is an order of magnitude more proteins to discover?

I think so. I, I, I'm not entirely sure, but there are a lot of proteins, and, um, I, I think we've, we've barely scratched the surface of measuring Earth's biodiversity. So there are core proteins that are conserved across all of life. So we, we, I think, know that, right? But, but as you go into these different environments, there's just, you know, new, new genes and no- new proteins constantly being created by evolution.

And this is a lot of, uh, my understanding is a lot of this is viruses and bacteria and other- Yeah, microorganisms ... microorganisms. Yeah. Eukaryotic organisms. And so those, and these guys are in this basically r- long-running conflict with each other that causes them to recombine their DNA in w- ways that help them to survive in these extreme or w- whatever environment.

Yeah, yeah. And so that that's what's causing this incredible diversity of proteins. That's right. Yeah. Yeah. And just four billion years of, of life running experiments in parallel all across the earth in all kinds of different ecological niches, and we just, we see the outcome of all of that. And so the combinatorial, that's why you believe that, yeah, there's, there's going-- Although the, maybe from a macroscopic perspective when we look at it, there's maybe not that m- not even nearly as much diversity as there will be at the microscopic scale because you have this m- incredible combinatorial effect.

Yeah. I mean, there's just, I think, tremendous, tremendous diversity there. So, so g- kinda going back then- Yeah ... to ESMC- Sorry, sort of nerd snipe there, but ... so, so, yeah, no, it's, it's, it's great, right? And I think it's really-- I mean, we could also talk about data and building models of the cell and kind of really going from the molecular level to- Yeah ...

you know, to, to, to higher levels of biological complexity. But, but to, to, to complete the, the, the description of, of ESMC, so, so that, that's really the big change was kind of adding these metagenomic sequences. And then, you know, what, what we saw basically is, is, is there aren't, are no longer diminishing returns to scale. So that's really saying that ESM2 was kind of data limited rather than compute limited.

For ESMC, there's a, there's a really beautiful scaling law that we can plot where we can look at, um, we can train models, you know, to make the larger models. We basically train models at the, at the smaller scale, and we can, we can really look at the best representational fidelity that they can achieve for a given compute budget and just draw a line of extrapolation out that, that kind of beautifully predicts what the larger scale models will, will be able to achieve in their representational fidelity.

So there's, there's this really beautiful scaling and, you know, the, the really, the only-- I mean, there are some re- Changes to, to, to ESMC just to make it a more efficient model for, for training. But, you know, the, I, I think the data is really, you know, the really big thing there that's, that's driving that.

Brandon Anderson25:18

So it still is basically just a standard vanilla trans-transformer, a few tricks. Everyone has-

Alex Rives25:22

It is

Brandon Anderson25:22

... a few tricks at this point.

Alex Rives25:23

Yeah. That's-

Brandon Anderson25:24

Massive language model and just a lot of data. Yeah.

Alex Rives25:27

So I mean, this is very much in contrast to something like AlphaFold, right?

Brandon Anderson25:30

Yeah.

Alex Rives25:30

Where you have a lot of inductive bias-

Brandon Anderson25:32

Mm-hmm

Alex Rives25:33

... in, built into the model in order to be able to predict protein structure. That's right, and the idea here, here is, you know, really can we just learn the right structure? You know?

Brandon Anderson25:43

Mm-hmm.

Alex Rives25:43

Don't, don't give any priors-

Brandon Anderson25:44

Mm-hmm

Alex Rives25:44

... just allow, you know, allow machine learning to figure out-

Brandon Anderson25:48

Alex Rives25:48

... what that structure is.

Brandon Anderson25:49

So you also had your own detour into priors with ESM3, right? Like, or maybe not priors, but, uh, using more intuition or more human design. Do you think ESM3 was a detour? Do you think there was... I mean, did you just end up saying like, "Okay, let's make C bigger," and then suddenly it worked and now you learned that actually we don't need priors anymore?

Programmability25:49

Brandon Anderson26:12

Is that like kind of a key insight, or do you still think there's room for priors?

Alex Rives26:17

I think we need both. I mean, I think there's a s- you know, there, there's a place for both of them. So, you know, the, the goal for ESM3 was to really make biology programmable, and so we're trying to think-

Brandon Anderson26:26

Mm-hmm

Alex Rives26:27

... okay, like, what is the programming language, right? How are you going to be able to allow biologists to, to prompt a model and-

Brandon Anderson26:34

Mm-hmm

Alex Rives26:34

... design structure and design function and all these things? And so we really thought it needed the right tracks.

Brandon Anderson26:40

Mm-hmm.

Alex Rives26:40

But, but I would say that ESM3 was, like, very consistent with the philosophy of ESM because, you know, what we did is we basically predicted structures for this vast array of evolutionarily diverse proteins and, and we're using that as the training data. So the model's now just, it's learning seq- from sequence patterns, learning from structural patterns, learning from functional patterns.

Brandon Anderson27:01

Mm-hmm.

Alex Rives27:01

But I think that same kind of synthesis that the model is learning on sequences, you could imagine that, you know, bringing in more multidimensional information would, would build an even better representation space.

Brandon Anderson27:13

If you are a coder, or if you, you're building, uh, language models and then building coding agents, you start with pre-training on everything, and then you go to doing the programming part by some sort of post-training, probably RL. I mean, have you thought about post-training, uh, ESMC to try give you the same abilities for programmability? Do you think you could get programmability without doing all of the inductive biases which involves, like, an atlas of structures and, you know, which mostly distil- so some sort of interesting distillation, but, uh, I, I guess maybe that is in some sense kind of post-training um, of a different model.

Yeah.

Alex Rives27:48

Yeah, I mean, I, I think it's a really in- interesting question, kind of to what degree can you interconvert these-

Brandon Anderson27:53

Mm-hmm

Alex Rives27:54

... models? I don't think kind of that's fully understood yet.

Brandon Anderson27:57

Mm-hmm.

Alex Rives27:57

But I, I think that's, it's a very kind of promising direction to think about-

Brandon Anderson28:01

Yeah

Alex Rives28:01

... um, doing that and what are the right ways to do that. So ESMC is al- is also approaching programmable biology, but I would say in a very different way. It's approaching it from this kind of world modeling perspective-

Brandon Anderson28:14

Mm-hmm

Alex Rives28:14

... where the idea is basically you have a predictive model and, you know, y- you're gonna search the world model to find protein molecules that, uh, that satisfy kind of whatever-

Brandon Anderson28:24

Alex Rives28:24

... design criteria that you have. So we've been able to use this to actually now go and design, um, many protein binders.

Brandon Anderson28:31

Right. Mm-hmm.

Alex Rives28:31

But I think sort of most excitingly-

Brandon Anderson28:33

Alex Rives28:33

... we've been able to use this to actually design antibodies, scFvs-

Brandon Anderson28:38

Mm. Mm

Alex Rives28:38

... and we're seeing really, I think, exciting-

Brandon Anderson28:40

Mm-hmm

Alex Rives28:40

... uh, success rates in a, in a-

Brandon Anderson28:42

Alex Rives28:42

... small number of trials now.

Brandon Anderson28:43

Yeah. Yeah. So can, can you explain what those scFVs or, are?

Alex Rives28:48

Yeah, yeah. So an scFV is basically, um, it's a, it's a single chain antibody.

Brandon Anderson28:53

Mm-hmm.

Alex Rives28:53

So it's, it's, um, a, a kind of therapeutic modality that basically has... So an antibody has, uh, a heavy chain and a light chain.

Brandon Anderson29:02

Mm-hmm.

Alex Rives29:02

And then it basically has a pair of, uh, you know, one heavy chain, one light chain, another heavy chain, and one light chain that, that come together to recognize a target. So there's different variations of, of these kind of modalities that are used therapeutically. And so what's interesting about the scFV is it has one heavy chain and one light chain, so it is able to kind of form these very complex binding interfaces where, you know, you can kind of have two different subunits coming together to engage a target.

These are kind of important therapeutic modality, um, something like tw- I think a quarter of, of, of new drugs are, are antibodies, so it's really, I think, you know, one, one of the, one of the critical, um, modalities for, for medicine. And basically what we're able to see is that, you know, you can search ESMC and you can actually find, um, antibodies that are reaching the level of affinity, they're, I should say, are really at the level-

Brandon Anderson29:56

Alex Rives29:56

... of affinity that is needed for therapeutic function and activity.

Brandon Anderson30:01

The pr- protein design space has kind of exploded in the last five years. You know, everyone is doing protein design, pretty... You know, many people are excited about protein design. Uh, my kind of high level naive understanding of the field is that things like mini binders, um, are, are quite doable. People have done that, you know, quite routinely successfully.

You know, in smaller-- By the time you get to like nanobodies into scFVs, they're a little bit harder to design. Um, and then antibodies are still actually quite out of reach oftentimes. One of the common reasons, you know, for this is if you're in the AlphaFold paradigm, you don't have MSAs, right? The evolutionary pressure for antibodies is actually the opposite in many ways of what the evolutionary pressure is for everything else.

They go for diversity rather than trying to be, w- go, uh, evolve along a very like constrained path. So I'm curious, h- did you try larger structures, and is that something that you've seen success on, or is this something that you still think for some reason it might be hard to do?

Alex Rives31:06

Well, you can actually take the, um, scFVs and reformat them-

Brandon Anderson31:09

Yeah. Yeah

Alex Rives31:10

... as, as antibodies. So I think that-

Brandon Anderson31:11

Yeah

Alex Rives31:11

... would be kind of the Quickest approach to do that. Um, we've not tried full IgGs. I, I don't see-

Brandon Anderson31:17

Mm-hmm

Alex Rives31:17

... any reason why that wouldn't work.

Brandon Anderson31:19

Yeah.

Alex Rives31:19

Actually, it's something we haven't yet.

Brandon Anderson31:21

Yeah.

Alex Rives31:21

You know, we, we've decided we're basically kind of releasing this now-

Brandon Anderson31:24

Mm-hmm

Alex Rives31:24

... because we feel like it's, it's kind of reached a point where, you know, we're, we're seeing I think a really, a significant step above kind of what's been possible in the past.

Brandon Anderson31:32

Mm-hmm.

Alex Rives31:33

And so we just, we wanted to get it out there, but-

Brandon Anderson31:35

Yeah

Alex Rives31:35

... you know, I, I think there's a lot more progress that's possible. So we're, you know, we have-

Brandon Anderson31:39

Oh, yeah

Alex Rives31:39

... collaborations to kind of look at some of the other-

Brandon Anderson31:41

Yeah, yeah

Alex Rives31:41

... applications here. You know, the thing about it, right?

Brandon Anderson31:43

Mm-hmm.

Alex Rives31:43

Is it, it's a general model. So I, I think to me that's the most exciting thing about it, is just, you know, a general model for protein sequence, structure, and function.

Brandon Anderson31:51

Yeah. Mm-hmm.

Alex Rives31:51

You can search it and, you know, therapeutic design basically emerges from that search.

Brandon Anderson31:57

Yeah. Mm-hmm. Yeah. I mean, to me the, the, the, you mentioned that you're not using MSAs, multi-seamus alignments, which was one of the, or maybe the critical insight that allowed AlphaFold to work really well. And the fact that you didn't need that in order to m- make it work basically as well as AlphaFold3 is really exciting to me, because that means that your thesis of let's, let's cover the space of possible proteins and as well as we can and see what the emergent behaviors are, so that if this is an emergent behavior, that we're kind of able to replicate what happens with multi-sequence align- when we have used multi-sequence alignment, what are the other things that maybe we don't have data for, but that we are able to also do in an emergent way?

Alex Rives32:44

I would say, actually, you know, we're, we're doing significantly better on, on antibodies, so I think, I think that's one of the things that's really cool. That's one of the theses that, that we had, is, you know, antibodies are not gonna benefit from, um, evolutionary information probably in the same way that kind of predicting the structural topology of, uh, of a molecule will.

So, you know, I think, I think you kind of see that now where the, the representation space is containing something that's really interesting about antibodies here.

Virtual Cell33:11

Brandon Anderson33:11

I want to talk about, 'cause you mentioned something very interesting to me, which was talking about virtual cell and how this maybe interfaces or d- this work here. I'm really interested to know, were you able to find other things in your mechanistic interpretability? What were some interesting things that weren't just validating biology, but there's a pattern that was unexpected?

Did you find anything like that?

Alex Rives33:34

It's complicated. So because w- we have to, we have to now actually go and validate some of these things, right?

Brandon Anderson33:39

Sure, yeah.

Alex Rives33:39

So I think what we saw are, like, interesting connections, right?

Brandon Anderson33:42

Mm-hmm.

Alex Rives33:42

So, um, you know, what we can see, for example, is that kind of distantly evolutionarily related gene editing systems clustered together in this space in ways that are consistent with and kind of reflect our knowledge of the origin of those gene editing systems. So that's really exciting. But, but kind of the thing is, right, there's a number of proteins that are in that map that are kind of brought together in different ways where we just, we don't know what they are right now.

We don't know what they do. So one hypothesis there is, well, these are kind of novel gene editing systems. I think in this atlas, you know, there's, there's gonna be some, some really interesting basis for scientific discovery there. And if you think about kind of how people go out and look for new gene editing systems, for example, they're typically mining the large genetic sequence databases, and they're looking for kind of different sequence patterns or structural patterns that are linked to that.

Actually, the first version of the ESM Atlas was, was used by Feng Zhang's group to find, um, a new gene editing system. So I, I think there's just a lot of biology out there that we don't understand that's waiting to be discovered and kind of being able to connect the dots between proteins so that we can go from, you know, what it is that we, we know today to, to kind of make those inferences about the unknown.

So that, that, that's what I'm excited about. And I think, you know, there are proteins for, for so many applications that nature has probably invented. You know, you think about the thermostable polymerase, which enables PCR, which came from a, a bacteria living in a thermal hot pool. You know, you, you have, there may be the solution to, to climate change, you know, somewhere in, in protein biology.

There are probably all kinds of building blocks for, for completely green chemistry infrastructure out there. There's probably new medicines and therapies, you know, but the, the question is, how do you find those? And so I think, you know, kind of being able to connect the dots is, is, is really one way to, to start to be able to open up that space of protein biology to discovery.

Brandon Anderson35:44

I'm curious, you've, uh, one of the advancements of ESMC is a improvement in multimer, so basically protein-protein interactions, like, um, that structure predict. The ability to predict the way two proteins interact, I think you now claim to do better than anyone else, right? Have I, correct me wrong?

Alex Rives36:02

Yeah, I mean, I think we're state of, state-of-the-art-

Brandon Anderson36:04

Yeah

Alex Rives36:04

... for protein models, yeah.

Brandon Anderson36:04

Okay. One thing which I know some people would find very useful for virtual cell is just an entire mapping of every single pair of proteins inside the human transcriptome. Have you thought about doing this in terms of, um, like kind of a beginning to a virtual cell, like create that map?

Alex Rives36:23

So I, I think something like that-

Brandon Anderson36:24

Yeah

Alex Rives36:24

... would be really valuable.

Brandon Anderson36:25

Mm-hmm.

Alex Rives36:25

I think, you know, fast. So the other thing about ESMFold 2 is a really fast model because it doesn't-

Brandon Anderson36:31

Yeah

Alex Rives36:31

... require the multiple sequence alignment.

Brandon Anderson36:33

Mm-hmm.

Alex Rives36:33

So, you know, you can do inference kind of, you know, directly from the sequence. Um, it takes seconds. You know, you can get an atomic resolution-

Brandon Anderson36:41

Alex Rives36:41

... prediction. So yeah, that, that's I think one really interesting application. At, at BioHub, I mean, the other thing that we're thinking about is can we actually experimentally resolve this?

Brandon Anderson36:51

Mm-hmm.

Alex Rives36:52

And so one of the things that we are, we are building is cryo-electron tomography, and, and we're really building systems that can greatly increase the contrast when you're looking at, you know, at the cell at the atomic level. And so I, I think one thing that I, I hope to see is actually is structurally empirically resolved interactome at some point in the future.

And I think there are some, some pretty big technical hurdles and, and technologies that have to be developed to, to overcome that. But I think that's something that, that's going to be possible. So we c- we can use computational methods to start to get the proxy of that, and I think, you know, that's gonna be really powerful.

But I, I think a lot of the future of structure prediction is gonna turn into structure determination, actually. You know, really bringing together these kind of tools that we have for modeling proteins and bringing them together with experimental data so that we can start to, you know, develop this picture that's, uh, you know, informed by empirical biology, by, by what we can observe.

RJ Haneke37:54

So is that the vision here, if I'm understanding correctly, is that you have maybe lab in the loop kind of thing, where you have an agent that is talking to your s- you know, C7 and whatever, and then it predicts a property that you're interested in. It sequences the, the, the genome, or it creates the genome, it creates the protein from the genome.

It-- then it observes it with some version of this, uh, microscope. What, what did you call the microscope again?

Alex Rives38:24

It's a cryo-electron tomograph.

RJ Haneke38:26

Okay, okay. And then you do whatever experiments, or you observe it, and then you use this as a lab in the loop to, "Oh, okay, this folds this way, therefore, I want to check... The next one that I want to check is actually a different one," and use a active learning system. Is that sort of the vision that you're articulating here?

Future Paradigm38:43

Alex Rives38:43

Well, I, I think there are gonna be, there are gonna be a few fundamental principles for the next era of, of biology. And I think, I think it's, yeah, it's such an interesting time right now because I think we're really, we're at the beginning of a new scientific paradigm. It's really just the beginning of it. And so what is, you know, what is defining in that paradigm, right?

And so I think there are, there are a few principles. You know, one is scaled data generation. I think that's gonna be really critical. The second is computational, you know, predictive digital representations of biology, and we can kind of talk about, you know, you can think of ESM as, as being, you know, first generation, AlphaFold as being a first generation of those kinds of approaches.

And so you can kind of start to think about what does that look like as we can model more and more biological complexity in that way. And then you have the principle of feedback, and you have the principle of, you know, we have intelligence now that's scalable and so can be applied to every unit of a biological problem.

What would it mean for all of that to come together? So I think we're gonna, we're gonna have increasingly capable and accurate digital representations of molecules, genomes, cells, ultimately physiology. That's where you want to get. We're gonna have to, have to go up that, that complexity scale, the levels of, of biological complexity. That requires traversing a, a data barrier.

There's, I think, data that, that does not exist that needs to be generated to achieve that level of predictive fidelity. And then we're going to have reasoning. And I think, you know, what that will mean is that we can reason over thousands, millions, hundreds of millions of scientific hypotheses in parallel digitally using predictive oracles, which can, you know, actually predict the outcome of an experiment.

So the scale that we can ask questions and the kinds of questions that we can ask will just fundamentally change through that. Feedback is gonna be critical. You know, the models are gonna need to-- There's gonna be sort of a scaling dimension of this, which is, which is, you know, building the data to have those accurate representations, and then a feedback dimension where the models can learn from biology, can reason digitally, can reduce that to a small number of experimental hypotheses, examine the outcome of each of those experiments, update, uh, their, their, their understanding, and, and build knowledge in that way.

So I think that's what's, what it's gonna look like, and, and we kind of have to build each of those components. What BioHub is really trying to do is to kind of bring together the experimental and the technology layer that will actually allow us to have these AI models interact with the biology and do experiments. And I think it's, you know, it's, it's Amdahl's law.

You know, we see incredible, incredible advance in areas where we can get feedback computationally, um, so in closed domains. But of course, you know, experimental biology is, is, is completely open-ended, and so the feedback principle there is, is going to be very different. But, you know, something, there's gonna be something like RLVR, you know, with, with experiments where we can, you know, have models that are, that are just really building knowledge and learning from that knowledge and being able to develop more and more accurate representations.

Brandon Anderson42:16

You're the head of sci-science at BioHub. Maybe fun fact for those who don't know, the science section of Latent Space was, um, basically launched after or in response to Mark Zuckerberg and Priscilla Chan on this podcast about six months ago. It's actually very exciting to have you here and kinda come full circle, and, uh, Mark laid out quite an ambitious vision for what BioHub was-- wants to accomplish, and I think you just laid up a very natural s- thought, you know, successor to that.

BioHub Mission42:16

Brandon Anderson42:45

I think you had just joined at, like, you were, like, there two weeks.

Alex Rives42:48

I joined-

Brandon Anderson42:49

Finally.

Alex Rives42:49

Yeah, yeah.

Brandon Anderson42:49

Yeah.

Alex Rives42:50

I think at the very end of October.

Brandon Anderson42:51

Yeah, yeah, yeah.

Alex Rives42:51

And launched at the beginning of November.

Brandon Anderson42:52

Yeah, yeah. One thing I'm curious about is, in your eyes, you know, where is BioHub now? Like, what, what do you want to, you know, accomplish? What are your goals, big picture goals? For listeners who haven't watched, um, that, the episode with, you know, Mark and Priscilla, um, we recommend, of course, that they go watch it. Link in the description.

And then have you learned anything, even in just the short time of six months you've been here? And, like, has the vision evolved? And, you know, where, where, where do you see this going? Um, I think we can, you know, how does ESMC fit into this? Uh, how does, you know, the virtual biology initiative that you recently announced f-fit into this?

And then I think there's, like, several other things that you're working on that we haven't even touched on.

Alex Rives43:33

Yeah.

Brandon Anderson43:33

Yeah.

Alex Rives43:33

Well, I, I'm learning things every single day.

Brandon Anderson43:35

Yeah.

Alex Rives43:36

But the way I think about it- We're building a scientific institution for this new paradigm. And, you know, to do that, you know, it's, it's, it's an institution that's going to be powered by frontier experimental biology-

Brandon Anderson43:51

Mm-hmm

Alex Rives43:51

... frontier technology for, for measurement, for observation, and it's going to be powered by frontier artificial intelligence. You know-

Brandon Anderson43:59

And this is all open source, right?

Alex Rives44:01

It's a philanthropy. So-

Brandon Anderson44:02

Mm-hmm

Alex Rives44:02

... so our goal is to accelerate science. Our mission is to cure or prevent disease. And so to do that, you know, our belief is that there's a fundamental gap in our understanding, and we need to accelerate science to traverse that gap. And so we're really thinking about every layer of biological understanding that goes from the most basic, you know, level, like the, you know, the, the atoms of a protein in a cell, all the way to systems of cells in physiology and disease, and how can we create models that can capture that complexity, can allow us to understand that complexity?

And I think, you know, if you think what, what is, you know, what does the cure to disease look like, right? It's, it's not, it's not a pill, right? It's not a medicine in the conventional sense. You know, it's, it's going to have to be, um, a system that is capable of modeling and understanding, you know, the underlying physiology of disease in a way that's differentiated for every single human being, for every single different genome, and it's going to have to be able to link events all the way at the molecular scale to the manifestation of disease in, in physiology.

So it's, it's an incredibly complex, incredibly hard problem. And for us, you know, we're trying to ladder up those layers of complexity, and we're trying to build kind of the foundational tools that scientists can use to, you know, answer the fundamental questions there. And so we're creating atomic level imaging. We're creating light-sheet microscopy that allows us to observe, you know, how, how all the cells move and, and develop in a, in a developing organism.

We're creating spatially and temporally resolved maps of, of inflammation. We're creating, you know, cellular, um, programming and immune cell reprogramming to be able to actually design completely programmable therapies. And you know, we're creating these digital representations at each of these layers so that we can accelerate the science, simulate what's happening, you know, make biological matter and make proteins and cells and genomes programmable.

And I think all of that has to, has to come together, and I think if you have the focus and you build, you know, the biology and the, the computational layers together so that they're tightly integrated, you know, that's how we're gonna make the fastest progress. For the last 10 years, I think we've been one of the, one of the big champions of open science.

You know, we're, we're an organization that does-- we both fund and we, and, and we build. And in our funding we've, we've always supported open science, and in our building, you know, we've always done open science. So that's, that's something that's gonna continue. It's, it's just really fundamental. We're not a drug development company.

Brandon Anderson46:52

Mm-hmm.

Alex Rives46:52

We're not trying to generate therapies. We're trying to, trying to build the technology that, that, that moves science forward.

Brandon Anderson46:59

So I think Mark had this concept about if you provide the right tools, then the entire scientific community can leverage them. Yeah, so obviously you believe strongly in protein language modeling as a tool. What is the next most important tool for advancing a general improvement in our ability to tackle human disease?

Cell Complexity46:59

Alex Rives47:20

Yeah. So I, I think the next level of complexity that we have to address is, is the complexity of the cell.

Brandon Anderson47:26

Mm.

Alex Rives47:26

And I mean, this is going to be tremendously hard. There are billions of proteins in a cell-

Brandon Anderson47:31

I'm glad that you say it's tremendously hard.

Alex Rives47:34

Yeah.

Brandon Anderson47:34

If you, you come and say it's gonna be easy-peasy-

Alex Rives47:36

Yeah

Brandon Anderson47:36

... I think we would just

Alex Rives47:37

Well, I, I think it's a, it's a worthy challenge.

Brandon Anderson47:40

Yeah.

Alex Rives47:40

But it, yeah, I mean, it requires technology that doesn't exist today. It requires new, you know, modeling approaches and, and arc- probably architectures and ideas and machine learning-

Brandon Anderson47:51

Alex Rives47:51

... that probably don't, don't yet exist. So there's, I think, deep and fundamental problems to solve. But again, I think, you know, you, you, you, you take it step by step. And so, you know, we kind of started the molecular layer, and we know that that is really fundamental, and we can begin to link that to, you know, observables in-

Brandon Anderson48:07

Alex Rives48:07

... in cellular biology.

Brandon Anderson48:08

I'm really curious 'cause this has been the question that's been on my mind for a long time about, we have virtual cell models, we have molecular scale models, and there's been a few-- I've seen a few papers about trying to link them. But what, what are you guys doing? 'Cause it sounds like this is becoming top of mind for you.

Alex Rives48:28

So let, let's maybe to make the analogy with protein biology. You know what, what I think makes our digital representations of proteins powerful and useful is that they generalize. They're able to make predictions for proteins that are entirely unlike the proteins in their training data. You know, they're able to generalize so that you can design, you know, fundamentally new folds, new binding interfaces, new structures.

So there's, there's this degree of, of, of, um, yeah, what, what we call generalization or generality. Um, the-- in, in short, they can predict the outcome of an experiment that we haven't already made, that they haven't already been trained on. So for, for digital representations to be valuable, you know, they've got to be able to be used to answer a new question.

So I think that's the critical thing. So I don't-- we're not there with cells. I think with, with, y- you know, kind of the current generation of models that are, that are being called virtual cells, they are good representations of the underlying data, but, you know, they have a very limited ability to predict what will happen when you make a n-novel intervention in a novel unobserved context.

But to be able to answer the fundamental scientific questions about cellular biology, we need a model that can do that. So, so kind of, you know, our thinking about this starts, starts with that idea is what's it gonna take to, to get there?

Brandon Anderson49:50

Going back to the protein-protein interaction, the human interactome of the... If you had that, just, you know, predicting static structures, static structures are in some sense not enough for a lot of understanding biology. Ah, dynamics are probably, for most people, a much more useful tool to have. You can start with static, it can give you some insight, but it's, it's very rarely the full answer.

So you have, you know, a model which is capable of predicting a lot of different proteins. We probably have almost all. We have, ah, many of these resolved in the PDB, ah, some of them we don't. Given the dynamics and interactions are being, are more important, how do you bridge that gap? Because to me, that seems like maybe one of the, the key steps in going from like a really microscopic model of things to going to something which is closer to a virtual cell.

You actually have to be able to model local interactions of local, ah, proteins or RNA or DNA or lipids or, you know, whatever else is floating in the cell. Is, is that sort of like a, a goal that you would try to bridge? Or maybe I'm misunderstanding. Is there like another way you would imagine bridging these two?

Alex Rives51:04

I mean, one, one day it'll probably be possible to have a computer that can kind of simulate the cell from first principles, but we're very far from that, right? I, I think, I think that's far beyond reach of, of current computational technology. I mean, kind of even simulating the, the physics of the folding of an-

Brandon Anderson51:22

Mm-hmm

Alex Rives51:22

... of a single, you know, protein molecule. Basically, we could do it for a fast-folding, a few fast-folding proteins-

Brandon Anderson51:28

Yeah. We just can't even really do it for one

Alex Rives51:29

... but that's about it.

Brandon Anderson51:29

Yeah, yeah, yeah.

Alex Rives51:30

Yeah. So there's kind of this dual view of biology, this dual complementary view of biology. One, one view is, is kind of that kind of first principles reduction, you know, where, where kind of all of biology is explainable in, you know, more basic terms, in, in, in basic physical, chemical, biochemical terms. And I think historically there's a long research, line of research that's really sought to, to understand biological phenomena and simulate biological phenomena in that way.

And I mean, I-i-if historically, you know, the field had believed that the solution to the protein folding problem or the protein structure prediction problem would come from, you know, this kind of first principles simulation, it really, really kind, kinda came, came out of nowhere that, you know, this could be solved using essentially pattern recognition or, you know, this, this, this type of machine learning approach.

So I think, I think historically it has been productive to understand biology through information theory, through information. And, you know, you can think about the cell as, as a computer, as an information processing machine. It's, in informational terms, there are, you know, these, these very basic principles that link the information encoded in the genome to the genes that are transcribed to the phenotypes of the cell that will result.

And so, you know, if we could model and understand the cell at the level of its underlying programs, you know, that, that sort of gives, I think, the right abstraction. What do I mean by the right abstraction? Well, I think, I mean the abstraction that is possible today because we're in the era of information theory at scale.

You know, we're, we're in the-- Cla- Claude Shannon, you know, had this idea of, of kind of the, the ideal predictor of the next character, and he, he had this really beautiful paper where he tried to compute the entropy of the English language and imagine basically, you know, taking an infinite context and then, you know, what is the entropy of the next character?

And at that time, I mean, it was unimaginable to, I-I'd say it create, it took a great leap of imagination to imagine that ideal predictor. But today we're v- we, we get closer and closer to being able to, to build that, and we can do that for text. And so, you know, what would that predictor be for biology?

And that's kind of the idea of, of ESM. It would learn kind of the, the, the underlying structure of, of all biological phenomena. So if you think about that from the standpoint of the cell, if we can collect enough outputs of cellular biology that we can observe to reveal the underlying programs, patterns, and structure, you know, then we could create kind of the information theoretic description of the cell, and I think that would be sufficient to understanding disease.

Brandon Anderson54:21

This is, reminds me of the, a lot of the work that happens in signaling pathways right now, right? Where you have protein in a cascade of different protein-protein interactions that eventually cause a phenotypic change in the cell in some way. How do you translate that into something that can be sort of scaled into a, or maybe it's something else, but how do you, for example that?

Alex Rives54:42

Yeah, going back to the bitter lesson.

Data Generation54:44

Brandon Anderson54:44

Yeah. Going back, let's, let's just get back to the bitter lesson.

Alex Rives54:46

We, we need data. I mean, I think, you know, why have these advances in protein biology been possible? They've been possible because of, you know, decades, I mean, for, for protein structure, half a century-

Brandon Anderson54:57

Alex Rives54:58

... of, of, of work to experimentally determine the structure of proteins and, you know, and, and the, the, you know, the effort across, you know, the scientific world to, you know, sequence genomes and metagenomes. And so that's created, you know, this, this data set that, you know, you, you, you can really, you know, train a scale on and really learn these, these deeper principles.

And so-

Brandon Anderson55:21

But those two different data sets are actually in many ways quite different. Like PDB, a bunch of, a bunch of very painstakingly-

Alex Rives55:28

Yeah. Yeah

Brandon Anderson55:28

... constructed protein structures, which many of which were an individual PhD thesis, and then maybe the follow similar ones came later, which might have been, you know, ten of them for a PhD thesis. Then this-- people always estimate it's like thirteen billion dollars to create the PDB, some like very large number. The reason pe-people created the PDB was because each individual protein was independently useful.

Like, people didn't create it with the sake of we're solving protein structure. They saw that like, "Oh, this protein, we believe, is involved in this dis-disease pathway. Let's understand this protein so we can target it," and so on. Of course, there's some caveats here, but at high level, a lot of this genomic data was, uh, especially for humans or viruses or bacteria, you know, was sequenced for a very specific reason as well, right?

Well, it's great that these are useful after the fact, but I, I wonder if now going forward, especially since, you know, with the Virtual Biology Initiative, um, BioHub's, uh, Virtual Biology Initiative of like half a billion dollars, I think, um, and I'm sure there will be more large initiatives coming from BioHub in the future, you know, you have the chance to be very specific, deliberate, and now collect data for the sake of solving a problem with ML, rather than depending on a data set which was curated, created for some other purpose.

So given that new opportunity, how do you do things differently? How do you think about data collection, um, to enable science broadly when you have the option of doing basically anything from first principles?

Alex Rives57:03

A little bit of context. We announced a few weeks ago the Virtual Biology Initiative. Um, we basically said, you know, we're gonna invest, uh, four hundred million internally in data creation and development of technology to scale data generation to be able to increase the, the number of modalities that we can measure simultaneously. We're-- We also, um, uh, announced that we're going to commit a, a hundred million to catalyzing efforts outside of BioHub to generate data.

And so, you know, we, we think that's, you know, a fraction of what's actually needed to, to do this, right? But, you know, the, the hope basically is that by making this initial commitment, kind of giving starting funds to some of the groups that are really thinking about this, you know, working to build different core areas of, of the data that's going to be needed, that, you know, that, that's gonna be a catalyst that's gonna galvanize other, other groups to come in and, and contribute to this.

So that's what we, we really hope to see. You know, the idea is that, you know, this is, this is broad, a broad-based effort, so it's, it's not just us. So, you know, I can, I can say kind of what my perspective is on what data needs to be generated here or what, what can be generated. But, you know, we also want to approach this really collaboratively with the scientific community.

And so-

Brandon Anderson58:18

Mm-hmm

Alex Rives58:18

... part of this is, is also kind of hearing from, from scientists what they want. So, so, so from my view, you know, there are a few key principles here. The first is speed, okay? So, you know, it took decades to build the data for proteins, and we can't wait decades. You know, this is-- We need to figure out how to do this in a couple of years.

And you look at the rate that, that, uh, general AI is developing, and it's just, you know, the limitation in biology, we're gonna be fundamentally limited by experimental science and data. And so we really need to, you know, work to address that gap as quickly as possible. So I think that's one key thing is, is looking at what are the technologies that we can scale up today to begin to, you know, give this picture of the information architecture of the cell.

So there's speed, and then there's also, uh, the idea of generalization. So kinda going back to what I was saying before, you know, we want models that can serve as oracles for the biology. They can predict an experiment that you haven't done. And so how are we going to be able to do that? We're gonna need to look at a multitude of different interventions in a multitude of different contexts.

And so it's, it's kind of similar to the principle of training a language model on the internet or training a protein language model across all of evolutionary diversity. What does that look like for cellular biology? And so we have to scale interventional biology, so that looks like things like perturbation biology, pertur- perturb-seq, measurements that where we can look at combined transcription, imaging, other, other layers of the cellular information hierarchy.

And there are, you know, a number, number of groups. Our, our teams are working on this. There are a number of groups across the scientific world that are, that are working on problems like this that are ready, I think, to, to scale. The second is spatial biology, and I think that's gonna be really important. And so that's gonna help us to really understand the cell in context.

And I think, you know, understanding the cell in isolation is, is really not what we need. It's not the goal, right? The cell is part of an incredibly complex system in the body. And, you know, to be able to understand disease, we have to understand how cells interact, the systems that they form, the circuits that they form.

So we need to see that. So spatial biology, I think, is undergoing rapid progress and, you know, is, is an area that's really ready to scale up. That's kind of what can scale now, I think. And, um, BioHub has actually, over the last ten years, really, I think, made kind of pioneering funding commitments in those, those areas.

And so we've, we've, we've funded efforts like the Human Cell Atlas, and we've built, uh, Tabula Sapiens, which looked at, uh, built, built large cell atlases, and we've bu-built CellxGene, which is kind of a database of single-cell transcriptomics. And so we're, you know, we're, we're, we're really looking to kind of build on that. And, you know, I, I don't know how many cells there are in kind of the, the largest efforts.

We're probably around like a billion cells or something like that today, so that we've got to go, you know, multiple orders of magnitude from that. So, you know, that involves scaling the technologies that we have now, but it also involves, you know, a new, the next generation of technology. So we're also funding and supporting efforts in that area.

And there, you know, we really want to look more at cross-modality. You know, can you simultaneously see the phenotype, observe the transcriptional layer, understand what's happening proteomically, link that to the genome? You know, we'd like to-- And the epi-the, the epigenetic state. You know, we'd like to be able to see all of that, and so really pushing technology to be developed faster that can reveal more of those connections and more of that biology and do that in a more scalable way.

Brandon Anderson1:01:59

It's interesting because when I hear most of those ideas, they're oftentimes the things that people already think about in terms of scaling biology. What is the next technology that is going to allow for, like, enabling data collection technology? Um, going to be back to the theme of Bitter Lesson for Biology, you don't have just scaling laws on, you know, compute and parameters, but now the scaling law is probably in data collection in some meaningful sense.

Where are the next big opportunities there? Like, in true, so you're talking about developing new technology as sort of like the, um, as part of the, this initiative.

Alex Rives1:02:37

Yeah.

Brandon Anderson1:02:37

Yeah.

Alex Rives1:02:37

So, I mean, I, I think it was basically the same things that I'm saying. As scaling what we have now-

Brandon Anderson1:02:41

Yeah

Alex Rives1:02:41

... kind of being able to expand the number of interventions that, that we can-

Brandon Anderson1:02:46

Yeah

Alex Rives1:02:46

... look at, expand the number of parameters that we can measure-

Brandon Anderson1:02:49

Mm-hmm

Alex Rives1:02:49

... so really kind of more and more multidimensional measurement, um, and, you know, drive down the cost and all of that. So better gene sequencing, kind of better ways of en- encapsulating cells and being able to measure what's happening, not just the transcriptome, but, but other layers simultaneously.

Brandon Anderson1:03:05

There's an interesting Pareto frontier there about, uh, if you have a fixed budget, how much time do you spend on improving your assay versus how much do you spend on actually scaling it? You know, where do you, uh, went out there?

Alex Rives1:03:16

Yeah.

Brandon Anderson1:03:16

Like-

Alex Rives1:03:16

We, we have to do both of those things-

Brandon Anderson1:03:18

Yeah

Alex Rives1:03:18

... right?

Brandon Anderson1:03:18

Yeah.

Alex Rives1:03:18

'Cause the-- I think with current technology, you know, we can definitely kind of get, get data at ten X to 100 X where it is today-

Brandon Anderson1:03:25

Alex Rives1:03:25

... with, like, relatively reasonable investments, you know. But, but then to get another ten X or more here-

Brandon Anderson1:03:30

Mm-hmm

Alex Rives1:03:30

... that's gonna require, require a lot more technology development.

Brandon Anderson1:03:34

Yeah.

Alex Rives1:03:34

But the other, the other really big principle is, is going to be feedback.

Brandon Anderson1:03:37

Mm-hmm.

Alex Rives1:03:37

And so I think that's gonna be really critical.

Brandon Anderson1:03:40

Mm-hmm.

Alex Rives1:03:40

And I think you can see that as a, a layer of, of technology development that's, that's, that's gonna need to occur, and I think there's a, a lot of great things happening right now in kind of automation, flexible robotics, that's gonna accelerate-

Brandon Anderson1:03:52

Mm-hmm

Alex Rives1:03:52

... um, where, where that can go.

RJ Haneke1:03:54

And the experimental design as well.

Bottlenecks1:03:54

Alex Rives1:03:56

Yeah.

RJ Haneke1:03:56

So we typically ask our guests what is a bi- a bottleneck that you would remove that would, you know, sort of unlock things, but we just spent a long time talking about that.

Alex Rives1:04:08

Yeah, I think I answered that question.

RJ Haneke1:04:09

So, so-

Alex Rives1:04:09

Yeah

RJ Haneke1:04:09

... so, um, but I wanna ask you about it, but I'm gonna give it a spin, which is like, so maybe a little bit outside of your domain, like, so language modeling or supply chain, something that is a bottleneck that is maybe non-obvious and not directly something that you are working on, but that maybe has impact on, on the work of biology or BioHub in particular.

Alex Rives1:04:32

I mean, it's, it's a hard question to answer 'cause there's just so many bottlenecks. I mean, the one that- ... you know, that I always think about is compute, but I think that's a pretty obvious one. It's the bottleneck for, for all of AI in, in many ways right now, but, you know, especially because we're training these large-scale models.

You know, our-- we're, we're always focused on compute, and I think we're, you know, kind of limited both by the data and compute. I think we're in a, you know, we're in a position where I think we, we have, you know, incredible compute resources for a team working in biology. But I think, like, like all teams working in AI right now, really the limit is, is just how much compute power.

RJ Haneke1:05:07

Mm-hmm. So if you could hundred X your compute, you think that ESMC would, like, be way better?

Alex Rives1:05:14

I mean, it would definitely be way better. We also need to scale data, so both of those things would-

RJ Haneke1:05:20

Yeah

Alex Rives1:05:20

... have to happen in tandem.

Brandon Anderson1:05:21

Have you basically exhausted what's available right now for-

Alex Rives1:05:24

I don't think so, no.

Brandon Anderson1:05:25

Okay.

Alex Rives1:05:26

No, I don't think so.

Brandon Anderson1:05:27

Okay. The large data sets out there, or you could-

Alex Rives1:05:30

Well, there's more-

Brandon Anderson1:05:30

... relatively, I mean, inf- compute is cheap, right?

Alex Rives1:05:32

More parameters.

Brandon Anderson1:05:33

Yeah. Yeah.

Alex Rives1:05:33

So, so we trained the ESMC up to six billion parameters.

Brandon Anderson1:05:37

Yeah. Oh, but I'm saying in terms of data available, like, have you exhausted most of what's publicly available in terms of, like-

Alex Rives1:05:44

Brandon Anderson1:05:44

... genomic?

Alex Rives1:05:44

No, not, not yet. And then, you know, the atlas-

Brandon Anderson1:05:47

Mm-hmm

Alex Rives1:05:47

... that we just built actually has more sequences and structures-

Brandon Anderson1:05:50

Yeah

Alex Rives1:05:50

... than ESMC was trained on. Yeah.

Brandon Anderson1:05:53

So definitely have a little room to go.

RJ Haneke1:05:54

It-- so how-- like what's-- is that a order of magnitude jump or twice as much? Like how does that, how does that work?

Alex Rives1:06:00

Yeah. I mean, I, I think ESMC is trained on, you know, say, order of a billion sequences.

RJ Haneke1:06:05

Mm-hmm. Mm-hmm.

Alex Rives1:06:05

So there's, there's definitely probably order of 100 billion sequences.

Brandon Anderson1:06:10

This is lar- a lot of them are largely redundant.

RJ Haneke1:06:13

Hu- 100 billion?

Alex Rives1:06:15

Yeah.

Brandon Anderson1:06:15

Yeah.

RJ Haneke1:06:15

Oh, okay. To get that billion, you whittled down from six billion, 6.8 billion, right? So of those 100 billion, if you were to similarly cluster and find unique ones, what do you think-- where do you think you would land?

Alex Rives1:06:28

The sequences aren't actually redundant, right? It really depends on what you mean by redundancy.

RJ Haneke1:06:32

Mm-hmm.

Alex Rives1:06:33

Because there's, I think, a tremendous amount that you can learn from small genetic variations, right?

RJ Haneke1:06:39

Mm-hmm.

Alex Rives1:06:39

'Cause these are, these are really revealing of, you know, kind of the, the very basic determinants of protein structure and function at a very fine level. So I think that, you know, as we think about protein space, you know, having a vast diversity of sequences across a wide range of protein families is, you know, really critical for the emergence of this kind of structure prediction capability because I, I think kind of large diversity is what trains the model to understand to develop a representation of structure.

But I actually think that to develop a representation of function, it's these very small variations that are important.

RJ Haneke1:07:16

Mm-hmm.

Alex Rives1:07:16

And so I, I do think that there is probably a lot more. You know, it's, it's like the models haven't yet been trained at that level of kind of just, like, really deep understanding of these very small but critical patterns and, and sequence. I mean, a single-

RJ Haneke1:07:31

Right

Alex Rives1:07:31

... a single mutation is enough to destroy the-

RJ Haneke1:07:33

Mm-hmm

Alex Rives1:07:33

... the function of a protein.

RJ Haneke1:07:34

So you could, you could conceivably actually take all 6.8 billion of those, retrain, everything's the same, but-

Alex Rives1:07:40

Yeah. Yeah

RJ Haneke1:07:41

... you do atlas.

Alex Rives1:07:41

No, you could, you could train on more than that. There are even-- I mean, that's kinda clustered down, so.

RJ Haneke1:07:46

Yeah.

Brandon Anderson1:07:46

Maybe the question is how far-- when do you hit the law of diminishing returns here? I mean, it sounds like you have plans for an ESM4 or an EM- ESMD or whatever you wanna call it.

Alex Rives1:07:56

We're always developing the next thing.

RJ Haneke1:07:57

Yeah.

Brandon Anderson1:07:57

Yeah. So yeah, but I- I'm just wondering it's, you know, at some point, is this actually something that you could exhaust? You know, people talk about exhausting the pre-training data, uh, on, on the internet or something.

Alex Rives1:08:07

Yeah, I mean, at, at some point, yeah.

Brandon Anderson1:08:09

Yeah, yeah.

Alex Rives1:08:09

At some point.

Brandon Anderson1:08:10

Yeah. I mean, is it-- but it sound actually something you could s- conceivably imagine in, imagine doing in the next few years. Or even if you don't exhaust it You hit a lot of diminishing returns for, you know, the applications that you're trying to predict here, where maybe your resources are better spent somewhere else.

Alex Rives1:08:26

I mean, it's, it's basically, it's an empirical question, right?

Brandon Anderson1:08:30

Yeah. Mm-hmm.

Alex Rives1:08:31

It's truly an empirical question, and so we, we just, we just don't know, you know?

Brandon Anderson1:08:35

It's-

Alex Rives1:08:35

I mean, with ESM2, we weren't sure-

Brandon Anderson1:08:37

Uh-huh

Alex Rives1:08:37

... 'cause there, there were some diminishing returns.

Brandon Anderson1:08:39

Mm-hmm.

Alex Rives1:08:39

With the ESM-C, you know, now, now there aren't, right?

Brandon Anderson1:08:42

Yeah.

Alex Rives1:08:42

So you can kind of look at, look at that, extrapolate from-

Brandon Anderson1:08:45

Mm-hmm

Alex Rives1:08:45

... from the scaling law there, and, you know, we- there is enough data to train that next, next model, so.

RJ Haneke1:08:52

And the other question that we usually ask is any call to action or what, what do you want people to go take action on? If the listeners want to get involved, get hired, get build things, what would you ask people to do?

Outro1:08:52

Alex Rives1:09:10

Well, we're, we just announced, or, or I should say we are, we are going, at the time that this, this podcast comes out, we will have announced ESM-C and this, this world model for protein biology. Um, it's gonna be open source. It's, it's gonna be MIT licensed, and we want people to use it. You know, we want, we want this to be a tool that can unlock science.

We're excited to collaborate. We have a team that's, that, that works on that, and we want to hear from people and understand, you know, what, what we can build that can help to accelerate their science.

RJ Haneke1:09:42

Awesome.

Brandon Anderson1:09:42

Yeah, we might have a, um, a demo/paper club of some sort on this channel, so stay tuned.

RJ Haneke1:09:48

Yeah, stay tuned for that.

Brandon Anderson1:09:49

Uh, yeah.

RJ Haneke1:09:49

We'll, we'll invite you and your team, um, whoever can make it. We'll feature this paper once it's in final preprint form and spend some time on it for an hour on the Wait in Space paper club.

Brandon Anderson1:10:02

Yeah. Uh, thanks for chatting with us.

Alex Rives1:10:03

Awesome. Yeah, great to meet you guys.

🔬 The Bitter Lesson is Coming for Proteins - Alex Rives, BioHub

Topics

Mentioned

Transcript

Intro0:00

Bitter Lesson1:02

Model History5:59

Data & Scale16:03

Programmability25:49

Virtual Cell33:11

Future Paradigm38:43

BioHub Mission42:16

Cell Complexity46:59

Data Generation54:44

Bottlenecks1:03:54

Outro1:08:52