Intro0:00
So ESM-C is al- is also approaching programmable biology, but I would say in a very different way. It's approaching it from this kind of world modeling perspective, where the idea is basically you have a predictive model and, you know, y- you're gonna search the world model to find protein molecules that satisfy kind of whatever design criteria that you have.
So we've been able to use this to actually now go and design, um, many protein binders.
Right. Mm-hmm.
But I think sort of most excitingly, we've been able to use this to actually design antibodies, scFvs.
Hello, welcome to the Latent Space AI for Science podcast. I'm RJ Haneki, CTO of Mira Omics.
Yeah, and, uh, I'm Brandon. Today, it's a pleasure to have Alex Rives, uh, head of science at BioHub. Yeah, would you like to introduce yourself real quick?
Yeah, yeah. Thank you for having me here. It's great to be here. Um, I'm head of science at BioHub. I'm a computer scientist, uh, and I work on AI for biology, and a lot of my work has been on language models for biology.
By the time this podcast is released, you will have put out several new, exciting, interesting models. Going over them, I couldn't help but have the kind of thought that you might be the most bitter lesson-filled person in protein biology right now. Can you give a little context about what that means for biology and, you know, why you're so committed and excited to this route?
Bitter Lesson1:02
Well, I'll take that. Uh- ... I believe in scaling laws.
Yeah.
So, you know, I guess I've been working on this for, you know, since, since the summer of 2018. Um, and so my team, when we were at Metafair, trained, uh, really the first transformer language model for protein biology. And so I guess, you know, I, I've always thought that there would be kind of emergence of biological information as you train a model to predict the next token, you know, that evolution creates.
So our team has really explored that idea over a number of different years, and we've really kind of, I think, seen the scaling curve and really seen as we have, have increased models by an order of magnitude, kind of in each generation that, you know, there's this emergence of new capabilities.
Yeah. So you've been, you say emergence of capabilities, scaling over generations. You've been working at this, as you said, for, I guess it would be eight years now, something like that? It didn't always work that way, right? Like, there was signs that scaling might work, you know. We'll be getting to some new results where it-- I think really you've kind of clearly demonstrated this hypothesis in a way that hasn't happened before.
But you seem to have, like, a strong commitment to this in a way that I'm not necessarily sure I would have been so convicted that it would work in the same way. I mean, proteins are not the-- protein language is not the same thing as natural language. There are similarities. You've, uh-- if you start sampling a transformer at, you know, a normal language transformer at, in temperature, you're gonna get gibberish.
You sample a protein t- language model at infertem- infinite temperature, you're gonna get something which is a valid protein, if not a not interesting protein. Despite the fact that it is a different domain for a different reason, I'm not necessarily sure that I would a priori assume the natural language model insight would transfer over. So what is specifically about proteins that you thought was special or, you, you know, that would make this also valid?
Yeah, I mean, it's a really interesting question, I think, kind of a deep question across AI right now-
Mm-hmm
... more broadly. And you know, I, I think, you know, what's, what's so interesting is AI right now is, is such an empirical science, and so we don't have, you know, theory that can always guide us in these things, but we have this really strong empirical evidence of scaling. The thing that I was motivated by is, you know, if you think about evolution and, you know, y- you think about the data that we, we have around proteins, we have databases that have billions of protein sequences.
And, you know, those, those sequences contain patterns. And, you know, it had, had been, long been known, so th- you know, this is going back, you know, decades kind of before, you know, we started working on this with language models, but that there are patterns, the sequences of protein families that come there because of the constraints that evo- evolution is operating under.
So you can think about, you know, like a, um, a protein sequence that folds into a three-dimensional structure in space, and you can, you know, imagine that there are two residues or amino acids that are in this sequence that might be in contact in that folded structure. And so evolution isn't free to choose those independently from each other.
If it makes a choice at, at one position, it kind of has to make another choice that's gonna be compatible at the next position. So going back, you know, all the way to the beginning of gene sequencing when people first began to be able to, to look at this and l- kind of look at different related, you know, the same protein and related organisms, you could start to see these kind of patterns that are reflecting the fundamental underlying biology.
So the idea behind ESM, kind of the thinking behind ESM was, okay, what if you were to apply this principle kind of across all of evolution, across kind of the vast diversity of proteins that have been generated across all of life and, you know, basically have a language model kind of predict the amino acids that evolution will choose to place in proteins across all of those biological contexts.
So you can think that there's just this, this kind of like incredible amount of information in that total picture about the underlying biology of proteins. And so that was really the idea that sparked this, is, is, you know, as, as a model is having to predict the next token, and actually we train these models with mass language modeling, so they're predicting kind of tokens that are masked out of various parts of the sequence, that it would have to learn something about those kind of underlying constraints that are shaping which tokens evolution can choose.
Model History5:59
Yeah. So maybe for a bit of history, um, so, you know, you, you have
You, you just released, um, evolutionary scale modeling Cambrian, right? Is that what it's called? Yeah. And this is like the maybe fourth or fifth in a series of models, I think maybe even more if you go back before they were called ESM.
No, they're, they, they were called ESM from the start.
Okay, they were called ESM from the start.
Yeah. We had sort of various branches-
Okay. Yeah, yeah
... of, of the different models.
Yeah, yeah.
Yeah. So, so this one I would say is, is kind of a, a fourth generation model. Um, it's actually a model that we trained a little over a year ago. Now that we're at BioHub, we're, um, we're, we're open sourcing this, this model fully under MIT license for the first time. So we're really excited to do that.
But kind of the, the big thing that is new here is that we've really kind of built a world model of protein biology. So the foundation of that is ESMC, but w- you know, using the representations of ESMC, we've kind of now built a, a structure prediction model, um, and this is the next generation ESMfold model. And then we've also used the techniques of, of, of mechanistic interpretability and sparse coding to really start to look deeply into the representation space of the language model and kind of be able to pull out the underlying features that the model actually uses to represent protein biology.
So bringing all of this together, we're able to, you know, really make predictions for protein structure, um, predictions about kind of the underlying features that, that proteins are made out of that allows us to build linkages across evolution. We're able to take this model and invert it to design proteins, and we've, we've, we've used this to kind of create a comprehensive picture of protein biology.
So we, we put together kind of all the world's largest protein sequence databases, and so that kind of amounts to 6.8 billion non-redundant proteins, and then we've, we've resolved predicted structures for 1.1 billion of those. And, and we've also computed features across all of those so that we can make these linkages basically all across, um, evolution and protein biology.
Right. 6.8 billion of which you've r- resolved structure for 1.2, is that right?
1.1.
1.1.
Yeah.
So what about the others?
Well, so, so basically what we did is we took that database and we clustered it at 70% sequence identity. So it's, it's really resolving structures for everything in the sense that for each cluster, we kind of have a cluster center, we're predicting the structure there, and then we can expect that the other proteins are gonna have a similar template structure.
I see.
There'll be, be small variations-
So they, they're-
... but they have the same fold
... 1.2 billion or so clusters.
That are, that are kind of covering the 6.8-
Yeah
... billion. Yeah.
Okay. Interesting. And you know, maybe w- since we're talking about scaling, how do you know that, um, this is the right number, right? Uh, like, uh, how do you know that focusing on these 1.1 billion, and that's the right resolution for this model?
Well, we've chosen them so that they really cover that entire space.
Mm-hmm.
So I think what I can say about this database is it's really the most comprehensive picture of protein structure and function that's been created. It's adding, you know, hundreds of millions of structures to our knowledge of, of kind of protein, the diversity of protein structure, and it's also creating this, uh, feature space that allows us to find these linkages between proteins across evolution.
So we can see kind of really interesting themes emerging across evolution, you know, linking, for example, um, gene editing s- systems which are very far apart in sequence but, you know, they share some kind of underlying functional, um, patterns, structural homology that the model's able to bring together and, and find those connections.
Now we're talking about the mechanistic interpretability part. So you have, if I understand correctly, you use sparse autoencoders and other techniques maybe to understand, okay, what are the-- when I activate the network using a protein, then what are the patterns of outputs that I'm seeing, and how do they relate to each other? If I understand correctly, is that you have these sequences that are unrelated or only partly related based on the actual sequence, but in terms of behavior, they have similar behavior and therefore they are activating similar networks.
Is that kind of the summary of what you just said?
Yeah. So I mean, basically what we've done is we've trained sparse autoencoders across all the different layers of the ESMC model family. So there's actually three models in that family. There's a 300 million parameter model, a 600 million parameter model, and a six billion parameter model. And then we've looked really, we've done kind of a very deep analysis of the feature space of that six billion parameter model, which is really the state-of-the-art protein language model.
And so w- what we find, what's really interesting is there's, there's kind of this, um, you know, the, this hierarchy of features that emerges. What's really interesting about it is it's, it really kind of corresponds to the reductive picture of biology that has been developed over, you know, ma- many decades, a century of, of, um, biological experiments.
But, but what's so, so cool is this is emerging, you know, without any prior knowledge. It's been learned by the language model. So the, the interesting thing about SAEs, right, is they're really just revealing the intrinsic structure of the representation space. So this model's been trained on protein sequences, it's been trained just to predict the amino acids that evolution will choose and then, you know, somehow this is leading to the emergence of this like kind of very ordered feature space that has this hierarchical structure where you can really see everything from the basic biochemical properties and kind of the basic structural building blocks of proteins to these very large kind of functional themes, these kind of abstract concepts that, you know, connect to, to how kind of the human picture of, of protein function.
And do you have a hypothesis or a feel for why-
If there are relationships between the sequences themselves, even if they're, like, shifted and, and cut up and recombined in different ways, like I can imagine that might work because you have these, you know, these proteins are kind of hierarchical in their nature as well. So maybe the hierarchy moves around, but they're the same sequence. But the functional units, I guess, th-those have related structures.
What is the hypothesis here?
I mean, it's, it's a really interesting question, right? And I think, I think I can speculate about it.
Sure.
I don't think we, we kind of completely understand this, right? But let, let me give a concrete example. So you know, the nucleophilic elbow is this kind of like core kind of functional motif that, uh, people have thought that, you know, maybe this actually, um, has emerged independently-
Mm-hmm
... in evolution, you know, different times and different protein families. But it has this, you know, very, very clear, um, structural motif that you can kind of see in a, in a crystal structure. You know, what we found basically is that the model has a kind of a single feature for this nucleophilic elbow, and it's activating across these like very evolutionarily diverse families.
You know, really completely different structural topologies, proteins that probably evolve like entirely independently from each other. But, you know, the model is kind of using this one feature to represent that. So why does it do that? I mean, I, I think it's a really interesting question. I mean, I think one answer is sort of the idea of, of compression and the idea that, you know, the model needs to have some kind of underlying latent variables that it develops to help solve this, this kind of sequence prediction task.
Because the nucleophilic elbow, you know, it's gonna be a function of... What's so, what's so interesting, right, is the choice of any amino acid is kind of like completely entangled with, with the choice of all the other amino acids in the sequence. So this is a very complex task to try to predict what amino acids should be where in a protein.
But to really do this well, you know, the model would start to have to have these kind of hidden variables that are representing the biology that allow it to, you know, look, look at a protein and say, "Okay, what amino acids should be there in all these different contexts?" So you know, that, that's sort of, I think, the intuition.
I mean, I would draw the parallel to language modeling, right? And so I, I guess I was, I was like very influenced by a paper by, um, uh, Zellig Harris. It was called Distributional Structure from, from 1954, and it's, I think a paper that influenced a lot of people in the language modeling field as well, you know.
But I, I think it has-- So it focuses on, on language and, and it really articulates this idea that, um, the set of contexts in which a word appears are determined by the meaning of that word. And so what Zellig Harris kind of imagined is that, you know, it would be, as you looked at the statistical patterns of, you know, what words appear in what context sets, you would be able to derive the meaning of language.
You know, you would, you would have this kind of statistical structure that would mirror the underlying meaning of language. For me at least, that's one of the most convincing explanations for why, you know, a language model that's trained on the text of the internet is going to learn something about meaning. It's gonna learn something deeper and more fundamental.
And so, so I think, you know, you can think about the same thing in, in biology, where the contexts in which an amino acid can occur are really determined by, you know, the structure or the function of the protein, its biological roles. You know, these, I mean, very complex phenomenon, uh, both the intrinsic biology of the protein and its relation to all of the other proteins and the function and evolution.
And so-- But, but those are what determine the context sets. And so you would imagine that then those statistical patterns in the use of amino acids, they directly reflect those underlying hidden variables. And so the model is gonna learn something about those hidden variables.
I definitely buy that seems plausible. Um, maybe just-- I mean, I-- In fact, I, I'm gonna clear. I actually do really believe in this direction, but there are a lot of, like, ways I think about this where maybe I could say, maybe I would imagine maybe it wouldn't work, and one of them is, like, data availability, right?
Data & Scale16:03
Like, what type of data do we normally, uh, have? What type of sequence data do we normally get? I think that ESMC in particular has some new data sources compared to, like, previous models, which might be helpful. But oftentimes, the type of sequences we have available have, like, a very strong bias towards certain specific needs for medicine or, you know, human biology or disease biology, right?
So it's not necessarily that if you take just a naive data set, you're gonna necessarily get an interesting scaling law. So I'm curious about, like, what in particular was the sort of breakthrough in ESMC. So maybe we can go back a bit and talk about some of the other ESM, you know, predecessors, which got here before ESMC, and how, like, you know, they were, you know, their strengths, but also maybe some of the limitations that ESMC overcame and, like, what the developments there were.
Yeah.
Sure.
Well, I'll, I'll admit that I am, I am bitter lesson-telling.
Yeah.
I am scaling-telling. And so I do think that, I mean, you know, just, just kind of increasing the data-
Mm-hmm
... increasing the parameters and having that compression is, is going to just lead to more powerful models. But you know, it is also true, and I think you're, you're absolutely right, the structure, the underlying kind of structure and distribution of the data is really critical. And so, you know, some data sets will be far more valuable for kind of learning these, these general principles than, than others.
But I think it goes against a lot of biological intuitions about collecting data, I guess is what I'd say, because normally when you think about what data do you want, you're trying to answer a very specific scientific hypothesis. You want, you know, a very well-controlled experiment. You know, you really want multiple replicas. You know, it's, it's something that's very focused, you know, is the way that I would put it.
So I think the change in the way of thinking is to think, okay, what you really want if you want to learn a general representation of proteins, is you want to see amino acids in as many evolutionary contexts as possible. That's really what you want. That's really kind of how I think about data. And I think if you-- what changed between ESM2, which was kind of the previous generation model, and ESMC, which is this new generation model, 'cause they're both at the approximately the same scale, and, you know-
Sorry, the same scale, uh, of compute-
S- same scale-
Or same scale of parameters
... in scale and parameters.
Okay.
Yeah. ESM2 got a lot of compute, but-
Mm-hmm
... ESMC got even more compute.
More, yeah.
But it's not just the compute. The data was, was really the critical thing here actually. So when we trained ESM2, we observed two things. The, the first was that as we increase the number of parameters and compute, you know, we saw improvement. So we had a model kind of at the billion parameter scale, we had a model at the 10 billion parameter scale, and, you know, the larger scale model is better than the smaller scale model.
But if you kind of look at a plot that of, of, uh, parameter scale, you know, sort of a log plot of parameter scale versus capability, and so for capability, you know, we're looking at kind of the representational fidelity, how well does it capture protein structure? You could see that there's, there's kinda diminishing returns in ESM2. ESM2 is trained on UniRef.
And for ESMC, we added metagenomics. So we added billions more sequences-
So-
... to the training data.
Yeah. Could you explain what UniRef and, uh, metagenomics means?
Yeah. Yeah. So U- UniRef is, um, I'd say sort of the, the gold standard dataset-
Mm-hmm
... of sequence biology. It's kind of, you know, taking sequences from across a wide variety of different sequencing resources. It's clustering them, you know, to kind of re- remove some of this redundancy that, that you were mentioning. And, and so it kind of creates a definitive coverage of, of, of protein biology. What has happened in parallel to the classical gene sequencing is, is this idea of metagenomic sequencing, where people go out into, you know, all kinds of different biomes and environments and collect samples from the world and just kind of sequence the, the natural diversity that's present there.
So, you know, proteins from a hydro- hydrothermal vent or proteins from a, from a, a frigid en- en- environment near the South Pole, you know, or, or the deep ocean, or, you know, soil, or the human gut. You know, all, all kinds of different environments.
So this is a very different way of collecting data. Instead of you are trying to understand a specific genome of a specific organism, or trying to understand a specific protein, you just collect a bunch of stuff, mix it up in a pipe, get the sequences out. You have no idea what organisms these are from. You don't necessarily even know if a given sequence is a protein, but you can guess based upon certain contexts.
And you say, "Okay, we throw these together. These are likely protein sequences we find. We're not assigning them to an organism. We're not assigning them to, like, a larger context. We're just saying, 'This is probably a protein. Let's train on it.'"
That is right. Yeah, and you, you don't even get the full genomes. You just get these kind of contigs that often are broken and have even partial proteins.
Mm-hmm.
So the, the data's really noisy. One more, like, little nerdy question that I have here is that if I understand correctly, you're not actually looking at so- using a device that sequences proteins. You're sequencing the DNA that would manufacture those proteins, so you're finding DNA and then looking for markers that indicate the beginning and end of a protein sequence.
Is that kinda- Yeah, that- ... correct? That's exactly right. Yeah. Okay. Basically sequencing, you know, genetic sequences, and then you translate the, the proteins from those sequences. And so you're digging up, like, sewers and-- Not you. But not me personally. That's not you. But there are scientists- But there are somebody who-
Digging in sewers
... who are doing this ... like, like probably-
Yeah
... many thousands of people.
The New York City subway.
Yeah.
You know, all, all kinds of things.
Yeah. So the natural question to me is, so you built this model and you think that you've kind of de-duplicated it so that you have a good representational set without a lot of redundancy in it. How much more is there? Like, if we had an order of magnitude more resources, do you think that there is an order of magnitude more proteins to discover?
I think so. I, I, I'm not entirely sure, but there are a lot of proteins, and, um, I, I think we've, we've barely scratched the surface of measuring Earth's biodiversity. So there are core proteins that are conserved across all of life. So we, we, I think, know that, right? But, but as you go into these different environments, there's just, you know, new, new genes and no- new proteins constantly being created by evolution.
And this is a lot of, uh, my understanding is a lot of this is viruses and bacteria and other- Yeah, microorganisms ... microorganisms. Yeah. Eukaryotic organisms. And so those, and these guys are in this basically r- long-running conflict with each other that causes them to recombine their DNA in w- ways that help them to survive in these extreme or w- whatever environment.
Yeah, yeah. And so that that's what's causing this incredible diversity of proteins. That's right. Yeah. Yeah. And just four billion years of, of life running experiments in parallel all across the earth in all kinds of different ecological niches, and we just, we see the outcome of all of that. And so the combinatorial, that's why you believe that, yeah, there's, there's going-- Although the, maybe from a macroscopic perspective when we look at it, there's maybe not that m- not even nearly as much diversity as there will be at the microscopic scale because you have this m- incredible combinatorial effect.
Yeah. I mean, there's just, I think, tremendous, tremendous diversity there. So, so g- kinda going back then- Yeah ... to ESMC- Sorry, sort of nerd snipe there, but ... so, so, yeah, no, it's, it's, it's great, right? And I think it's really-- I mean, we could also talk about data and building models of the cell and kind of really going from the molecular level to- Yeah ...
you know, to, to, to higher levels of biological complexity. But, but to, to, to complete the, the, the description of, of ESMC, so, so that, that's really the big change was kind of adding these metagenomic sequences. And then, you know, what, what we saw basically is, is, is there aren't, are no longer diminishing returns to scale. So that's really saying that ESM2 was kind of data limited rather than compute limited.
For ESMC, there's a, there's a really beautiful scaling law that we can plot where we can look at, um, we can train models, you know, to make the larger models. We basically train models at the, at the smaller scale, and we can, we can really look at the best representational fidelity that they can achieve for a given compute budget and just draw a line of extrapolation out that, that kind of beautifully predicts what the larger scale models will, will be able to achieve in their representational fidelity.
So there's, there's this really beautiful scaling and, you know, the, the really, the only-- I mean, there are some re- Changes to, to, to ESMC just to make it a more efficient model for, for training. But, you know, the, I, I think the data is really, you know, the really big thing there that's, that's driving that.
So it still is basically just a standard vanilla trans-transformer, a few tricks. Everyone has-
It is
... a few tricks at this point.
Yeah. That's-
Massive language model and just a lot of data. Yeah.
So I mean, this is very much in contrast to something like AlphaFold, right?
Yeah.
Where you have a lot of inductive bias-
Mm-hmm
... in, built into the model in order to be able to predict protein structure. That's right, and the idea here, here is, you know, really can we just learn the right structure? You know?
Mm-hmm.
Don't, don't give any priors-
Mm-hmm
... just allow, you know, allow machine learning to figure out-
Mm
... what that structure is.
So you also had your own detour into priors with ESM3, right? Like, or maybe not priors, but, uh, using more intuition or more human design. Do you think ESM3 was a detour? Do you think there was... I mean, did you just end up saying like, "Okay, let's make C bigger," and then suddenly it worked and now you learned that actually we don't need priors anymore?
Programmability25:49
Is that like kind of a key insight, or do you still think there's room for priors?
I think we need both. I mean, I think there's a s- you know, there, there's a place for both of them. So, you know, the, the goal for ESM3 was to really make biology programmable, and so we're trying to think-
Mm-hmm
... okay, like, what is the programming language, right? How are you going to be able to allow biologists to, to prompt a model and-
Mm-hmm
... design structure and design function and all these things? And so we really thought it needed the right tracks.
Mm-hmm.
But, but I would say that ESM3 was, like, very consistent with the philosophy of ESM because, you know, what we did is we basically predicted structures for this vast array of evolutionarily diverse proteins and, and we're using that as the training data. So the model's now just, it's learning seq- from sequence patterns, learning from structural patterns, learning from functional patterns.
Mm-hmm.
But I think that same kind of synthesis that the model is learning on sequences, you could imagine that, you know, bringing in more multidimensional information would, would build an even better representation space.
If you are a coder, or if you, you're building, uh, language models and then building coding agents, you start with pre-training on everything, and then you go to doing the programming part by some sort of post-training, probably RL. I mean, have you thought about post-training, uh, ESMC to try give you the same abilities for programmability? Do you think you could get programmability without doing all of the inductive biases which involves, like, an atlas of structures and, you know, which mostly distil- so some sort of interesting distillation, but, uh, I, I guess maybe that is in some sense kind of post-training um, of a different model.
Yeah.
Yeah, I mean, I, I think it's a really in- interesting question, kind of to what degree can you interconvert these-
Mm-hmm
... models? I don't think kind of that's fully understood yet.
Mm-hmm.
But I, I think that's, it's a very kind of promising direction to think about-
Yeah
... um, doing that and what are the right ways to do that. So ESMC is al- is also approaching programmable biology, but I would say in a very different way. It's approaching it from this kind of world modeling perspective-
Mm-hmm
... where the idea is basically you have a predictive model and, you know, y- you're gonna search the world model to find protein molecules that, uh, that satisfy kind of whatever-
Mm
... design criteria that you have. So we've been able to use this to actually now go and design, um, many protein binders.
Right. Mm-hmm.
But I think sort of most excitingly-
Mm
... we've been able to use this to actually design antibodies, scFvs-
Mm. Mm
... and we're seeing really, I think, exciting-
Mm-hmm
... uh, success rates in a, in a-
Mm
... small number of trials now.
Yeah. Yeah. So can, can you explain what those scFVs or, are?
Yeah, yeah. So an scFV is basically, um, it's a, it's a single chain antibody.
Mm-hmm.
So it's, it's, um, a, a kind of therapeutic modality that basically has... So an antibody has, uh, a heavy chain and a light chain.
Mm-hmm.
And then it basically has a pair of, uh, you know, one heavy chain, one light chain, another heavy chain, and one light chain that, that come together to recognize a target. So there's different variations of, of these kind of modalities that are used therapeutically. And so what's interesting about the scFV is it has one heavy chain and one light chain, so it is able to kind of form these very complex binding interfaces where, you know, you can kind of have two different subunits coming together to engage a target.
These are kind of important therapeutic modality, um, something like tw- I think a quarter of, of, of new drugs are, are antibodies, so it's really, I think, you know, one, one of the, one of the critical, um, modalities for, for medicine. And basically what we're able to see is that, you know, you can search ESMC and you can actually find, um, antibodies that are reaching the level of affinity, they're, I should say, are really at the level-
Mm
... of affinity that is needed for therapeutic function and activity.
The pr- protein design space has kind of exploded in the last five years. You know, everyone is doing protein design, pretty... You know, many people are excited about protein design. Uh, my kind of high level naive understanding of the field is that things like mini binders, um, are, are quite doable. People have done that, you know, quite routinely successfully.
You know, in smaller-- By the time you get to like nanobodies into scFVs, they're a little bit harder to design. Um, and then antibodies are still actually quite out of reach oftentimes. One of the common reasons, you know, for this is if you're in the AlphaFold paradigm, you don't have MSAs, right? The evolutionary pressure for antibodies is actually the opposite in many ways of what the evolutionary pressure is for everything else.
They go for diversity rather than trying to be, w- go, uh, evolve along a very like constrained path. So I'm curious, h- did you try larger structures, and is that something that you've seen success on, or is this something that you still think for some reason it might be hard to do?
Well, you can actually take the, um, scFVs and reformat them-
Yeah. Yeah
... as, as antibodies. So I think that-
Yeah
... would be kind of the Quickest approach to do that. Um, we've not tried full IgGs. I, I don't see-
Mm-hmm
... any reason why that wouldn't work.
Yeah.
Actually, it's something we haven't yet.
Yeah.
You know, we, we've decided we're basically kind of releasing this now-
Mm-hmm
... because we feel like it's, it's kind of reached a point where, you know, we're, we're seeing I think a really, a significant step above kind of what's been possible in the past.
Mm-hmm.
And so we just, we wanted to get it out there, but-
Yeah
... you know, I, I think there's a lot more progress that's possible. So we're, you know, we have-
Oh, yeah
... collaborations to kind of look at some of the other-
Yeah, yeah
... applications here. You know, the thing about it, right?
Mm-hmm.
Is it, it's a general model. So I, I think to me that's the most exciting thing about it, is just, you know, a general model for protein sequence, structure, and function.
Yeah. Mm-hmm.
You can search it and, you know, therapeutic design basically emerges from that search.
Yeah. Mm-hmm. Yeah. I mean, to me the, the, the, you mentioned that you're not using MSAs, multi-seamus alignments, which was one of the, or maybe the critical insight that allowed AlphaFold to work really well. And the fact that you didn't need that in order to m- make it work basically as well as AlphaFold3 is really exciting to me, because that means that your thesis of let's, let's cover the space of possible proteins and as well as we can and see what the emergent behaviors are, so that if this is an emergent behavior, that we're kind of able to replicate what happens with multi-sequence align- when we have used multi-sequence alignment, what are the other things that maybe we don't have data for, but that we are able to also do in an emergent way?
I would say, actually, you know, we're, we're doing significantly better on, on antibodies, so I think, I think that's one of the things that's really cool. That's one of the theses that, that we had, is, you know, antibodies are not gonna benefit from, um, evolutionary information probably in the same way that kind of predicting the structural topology of, uh, of a molecule will.
So, you know, I think, I think you kind of see that now where the, the representation space is containing something that's really interesting about antibodies here.
Virtual Cell33:11
I want to talk about, 'cause you mentioned something very interesting to me, which was talking about virtual cell and how this maybe interfaces or d- this work here. I'm really interested to know, were you able to find other things in your mechanistic interpretability? What were some interesting things that weren't just validating biology, but there's a pattern that was unexpected?
Did you find anything like that?
It's complicated. So because w- we have to, we have to now actually go and validate some of these things, right?
Sure, yeah.
So I think what we saw are, like, interesting connections, right?
Mm-hmm.
So, um, you know, what we can see, for example, is that kind of distantly evolutionarily related gene editing systems clustered together in this space in ways that are consistent with and kind of reflect our knowledge of the origin of those gene editing systems. So that's really exciting. But, but kind of the thing is, right, there's a number of proteins that are in that map that are kind of brought together in different ways where we just, we don't know what they are right now.
We don't know what they do. So one hypothesis there is, well, these are kind of novel gene editing systems. I think in this atlas, you know, there's, there's gonna be some, some really interesting basis for scientific discovery there. And if you think about kind of how people go out and look for new gene editing systems, for example, they're typically mining the large genetic sequence databases, and they're looking for kind of different sequence patterns or structural patterns that are linked to that.
Actually, the first version of the ESM Atlas was, was used by Feng Zhang's group to find, um, a new gene editing system. So I, I think there's just a lot of biology out there that we don't understand that's waiting to be discovered and kind of being able to connect the dots between proteins so that we can go from, you know, what it is that we, we know today to, to kind of make those inferences about the unknown.
So that, that, that's what I'm excited about. And I think, you know, there are proteins for, for so many applications that nature has probably invented. You know, you think about the thermostable polymerase, which enables PCR, which came from a, a bacteria living in a thermal hot pool. You know, you, you have, there may be the solution to, to climate change, you know, somewhere in, in protein biology.
There are probably all kinds of building blocks for, for completely green chemistry infrastructure out there. There's probably new medicines and therapies, you know, but the, the question is, how do you find those? And so I think, you know, kind of being able to connect the dots is, is, is really one way to, to start to be able to open up that space of protein biology to discovery.
I'm curious, you've, uh, one of the advancements of ESMC is a improvement in multimer, so basically protein-protein interactions, like, um, that structure predict. The ability to predict the way two proteins interact, I think you now claim to do better than anyone else, right? Have I, correct me wrong?
Yeah, I mean, I think we're state of, state-of-the-art-
Yeah
... for protein models, yeah.
Okay. One thing which I know some people would find very useful for virtual cell is just an entire mapping of every single pair of proteins inside the human transcriptome. Have you thought about doing this in terms of, um, like kind of a beginning to a virtual cell, like create that map?
So I, I think something like that-
Yeah
... would be really valuable.
Mm-hmm.
I think, you know, fast. So the other thing about ESMFold 2 is a really fast model because it doesn't-
Yeah
... require the multiple sequence alignment.
Mm-hmm.
So, you know, you can do inference kind of, you know, directly from the sequence. Um, it takes seconds. You know, you can get an atomic resolution-
Mm
... prediction. So yeah, that, that's I think one really interesting application. At, at BioHub, I mean, the other thing that we're thinking about is can we actually experimentally resolve this?
Mm-hmm.
And so one of the things that we are, we are building is cryo-electron tomography, and, and we're really building systems that can greatly increase the contrast when you're looking at, you know, at the cell at the atomic level. And so I, I think one thing that I, I hope to see is actually is structurally empirically resolved interactome at some point in the future.
And I think there are some, some pretty big technical hurdles and, and technologies that have to be developed to, to overcome that. But I think that's something that, that's going to be possible. So we c- we can use computational methods to start to get the proxy of that, and I think, you know, that's gonna be really powerful.
But I, I think a lot of the future of structure prediction is gonna turn into structure determination, actually. You know, really bringing together these kind of tools that we have for modeling proteins and bringing them together with experimental data so that we can start to, you know, develop this picture that's, uh, you know, informed by empirical biology, by, by what we can observe.
So is that the vision here, if I'm understanding correctly, is that you have maybe lab in the loop kind of thing, where you have an agent that is talking to your s- you know, C7 and whatever, and then it predicts a property that you're interested in. It sequences the, the, the genome, or it creates the genome, it creates the protein from the genome.
It-- then it observes it with some version of this, uh, microscope. What, what did you call the microscope again?
It's a cryo-electron tomograph.
Okay, okay. And then you do whatever experiments, or you observe it, and then you use this as a lab in the loop to, "Oh, okay, this folds this way, therefore, I want to check... The next one that I want to check is actually a different one," and use a active learning system. Is that sort of the vision that you're articulating here?
Future Paradigm38:43
Well, I, I think there are gonna be, there are gonna be a few fundamental principles for the next era of, of biology. And I think, I think it's, yeah, it's such an interesting time right now because I think we're really, we're at the beginning of a new scientific paradigm. It's really just the beginning of it. And so what is, you know, what is defining in that paradigm, right?
And so I think there are, there are a few principles. You know, one is scaled data generation. I think that's gonna be really critical. The second is computational, you know, predictive digital representations of biology, and we can kind of talk about, you know, you can think of ESM as, as being, you know, first generation, AlphaFold as being a first generation of those kinds of approaches.
And so you can kind of start to think about what does that look like as we can model more and more biological complexity in that way. And then you have the principle of feedback, and you have the principle of, you know, we have intelligence now that's scalable and so can be applied to every unit of a biological problem.
What would it mean for all of that to come together? So I think we're gonna, we're gonna have increasingly capable and accurate digital representations of molecules, genomes, cells, ultimately physiology. That's where you want to get. We're gonna have to, have to go up that, that complexity scale, the levels of, of biological complexity. That requires traversing a, a data barrier.
There's, I think, data that, that does not exist that needs to be generated to achieve that level of predictive fidelity. And then we're going to have reasoning. And I think, you know, what that will mean is that we can reason over thousands, millions, hundreds of millions of scientific hypotheses in parallel digitally using predictive oracles, which can, you know, actually predict the outcome of an experiment.
So the scale that we can ask questions and the kinds of questions that we can ask will just fundamentally change through that. Feedback is gonna be critical. You know, the models are gonna need to-- There's gonna be sort of a scaling dimension of this, which is, which is, you know, building the data to have those accurate representations, and then a feedback dimension where the models can learn from biology, can reason digitally, can reduce that to a small number of experimental hypotheses, examine the outcome of each of those experiments, update, uh, their, their, their understanding, and, and build knowledge in that way.
So I think that's what's, what it's gonna look like, and, and we kind of have to build each of those components. What BioHub is really trying to do is to kind of bring together the experimental and the technology layer that will actually allow us to have these AI models interact with the biology and do experiments. And I think it's, you know, it's, it's Amdahl's law.
You know, we see incredible, incredible advance in areas where we can get feedback computationally, um, so in closed domains. But of course, you know, experimental biology is, is, is completely open-ended, and so the feedback principle there is, is going to be very different. But, you know, something, there's gonna be something like RLVR, you know, with, with experiments where we can, you know, have models that are, that are just really building knowledge and learning from that knowledge and being able to develop more and more accurate representations.
You're the head of sci-science at BioHub. Maybe fun fact for those who don't know, the science section of Latent Space was, um, basically launched after or in response to Mark Zuckerberg and Priscilla Chan on this podcast about six months ago. It's actually very exciting to have you here and kinda come full circle, and, uh, Mark laid out quite an ambitious vision for what BioHub was-- wants to accomplish, and I think you just laid up a very natural s- thought, you know, successor to that.
BioHub Mission42:16
I think you had just joined at, like, you were, like, there two weeks.
I joined-
Finally.
Yeah, yeah.
Yeah.
I think at the very end of October.
Yeah, yeah, yeah.
And launched at the beginning of November.
Yeah, yeah. One thing I'm curious about is, in your eyes, you know, where is BioHub now? Like, what, what do you want to, you know, accomplish? What are your goals, big picture goals? For listeners who haven't watched, um, that, the episode with, you know, Mark and Priscilla, um, we recommend, of course, that they go watch it. Link in the description.
And then have you learned anything, even in just the short time of six months you've been here? And, like, has the vision evolved? And, you know, where, where, where do you see this going? Um, I think we can, you know, how does ESMC fit into this? Uh, how does, you know, the virtual biology initiative that you recently announced f-fit into this?
And then I think there's, like, several other things that you're working on that we haven't even touched on.
Yeah.
Yeah.
Well, I, I'm learning things every single day.
Yeah.
But the way I think about it- We're building a scientific institution for this new paradigm. And, you know, to do that, you know, it's, it's, it's an institution that's going to be powered by frontier experimental biology-
Mm-hmm
... frontier technology for, for measurement, for observation, and it's going to be powered by frontier artificial intelligence. You know-
And this is all open source, right?
It's a philanthropy. So-
Mm-hmm
... so our goal is to accelerate science. Our mission is to cure or prevent disease. And so to do that, you know, our belief is that there's a fundamental gap in our understanding, and we need to accelerate science to traverse that gap. And so we're really thinking about every layer of biological understanding that goes from the most basic, you know, level, like the, you know, the, the atoms of a protein in a cell, all the way to systems of cells in physiology and disease, and how can we create models that can capture that complexity, can allow us to understand that complexity?
And I think, you know, if you think what, what is, you know, what does the cure to disease look like, right? It's, it's not, it's not a pill, right? It's not a medicine in the conventional sense. You know, it's, it's going to have to be, um, a system that is capable of modeling and understanding, you know, the underlying physiology of disease in a way that's differentiated for every single human being, for every single different genome, and it's going to have to be able to link events all the way at the molecular scale to the manifestation of disease in, in physiology.
So it's, it's an incredibly complex, incredibly hard problem. And for us, you know, we're trying to ladder up those layers of complexity, and we're trying to build kind of the foundational tools that scientists can use to, you know, answer the fundamental questions there. And so we're creating atomic level imaging. We're creating light-sheet microscopy that allows us to observe, you know, how, how all the cells move and, and develop in a, in a developing organism.
We're creating spatially and temporally resolved maps of, of inflammation. We're creating, you know, cellular, um, programming and immune cell reprogramming to be able to actually design completely programmable therapies. And you know, we're creating these digital representations at each of these layers so that we can accelerate the science, simulate what's happening, you know, make biological matter and make proteins and cells and genomes programmable.
And I think all of that has to, has to come together, and I think if you have the focus and you build, you know, the biology and the, the computational layers together so that they're tightly integrated, you know, that's how we're gonna make the fastest progress. For the last 10 years, I think we've been one of the, one of the big champions of open science.
You know, we're, we're an organization that does-- we both fund and we, and, and we build. And in our funding we've, we've always supported open science, and in our building, you know, we've always done open science. So that's, that's something that's gonna continue. It's, it's just really fundamental. We're not a drug development company.
Mm-hmm.
We're not trying to generate therapies. We're trying to, trying to build the technology that, that, that moves science forward.
So I think Mark had this concept about if you provide the right tools, then the entire scientific community can leverage them. Yeah, so obviously you believe strongly in protein language modeling as a tool. What is the next most important tool for advancing a general improvement in our ability to tackle human disease?
Cell Complexity46:59
Yeah. So I, I think the next level of complexity that we have to address is, is the complexity of the cell.
Mm.
And I mean, this is going to be tremendously hard. There are billions of proteins in a cell-
I'm glad that you say it's tremendously hard.
Yeah.
If you, you come and say it's gonna be easy-peasy-
Yeah
... I think we would just
Well, I, I think it's a, it's a worthy challenge.
Yeah.
But it, yeah, I mean, it requires technology that doesn't exist today. It requires new, you know, modeling approaches and, and arc- probably architectures and ideas and machine learning-
Mm
... that probably don't, don't yet exist. So there's, I think, deep and fundamental problems to solve. But again, I think, you know, you, you, you, you take it step by step. And so, you know, we kind of started the molecular layer, and we know that that is really fundamental, and we can begin to link that to, you know, observables in-
Mm
... in cellular biology.
I'm really curious 'cause this has been the question that's been on my mind for a long time about, we have virtual cell models, we have molecular scale models, and there's been a few-- I've seen a few papers about trying to link them. But what, what are you guys doing? 'Cause it sounds like this is becoming top of mind for you.
So let, let's maybe to make the analogy with protein biology. You know what, what I think makes our digital representations of proteins powerful and useful is that they generalize. They're able to make predictions for proteins that are entirely unlike the proteins in their training data. You know, they're able to generalize so that you can design, you know, fundamentally new folds, new binding interfaces, new structures.
So there's, there's this degree of, of, of, um, yeah, what, what we call generalization or generality. Um, the-- in, in short, they can predict the outcome of an experiment that we haven't already made, that they haven't already been trained on. So for, for digital representations to be valuable, you know, they've got to be able to be used to answer a new question.
So I think that's the critical thing. So I don't-- we're not there with cells. I think with, with, y- you know, kind of the current generation of models that are, that are being called virtual cells, they are good representations of the underlying data, but, you know, they have a very limited ability to predict what will happen when you make a n-novel intervention in a novel unobserved context.
But to be able to answer the fundamental scientific questions about cellular biology, we need a model that can do that. So, so kind of, you know, our thinking about this starts, starts with that idea is what's it gonna take to, to get there?
Going back to the protein-protein interaction, the human interactome of the... If you had that, just, you know, predicting static structures, static structures are in some sense not enough for a lot of understanding biology. Ah, dynamics are probably, for most people, a much more useful tool to have. You can start with static, it can give you some insight, but it's, it's very rarely the full answer.
So you have, you know, a model which is capable of predicting a lot of different proteins. We probably have almost all. We have, ah, many of these resolved in the PDB, ah, some of them we don't. Given the dynamics and interactions are being, are more important, how do you bridge that gap? Because to me, that seems like maybe one of the, the key steps in going from like a really microscopic model of things to going to something which is closer to a virtual cell.
You actually have to be able to model local interactions of local, ah, proteins or RNA or DNA or lipids or, you know, whatever else is floating in the cell. Is, is that sort of like a, a goal that you would try to bridge? Or maybe I'm misunderstanding. Is there like another way you would imagine bridging these two?
I mean, one, one day it'll probably be possible to have a computer that can kind of simulate the cell from first principles, but we're very far from that, right? I, I think, I think that's far beyond reach of, of current computational technology. I mean, kind of even simulating the, the physics of the folding of an-
Mm-hmm
... of a single, you know, protein molecule. Basically, we could do it for a fast-folding, a few fast-folding proteins-
Yeah. We just can't even really do it for one
... but that's about it.
Yeah, yeah, yeah.
Yeah. So there's kind of this dual view of biology, this dual complementary view of biology. One, one view is, is kind of that kind of first principles reduction, you know, where, where kind of all of biology is explainable in, you know, more basic terms, in, in, in basic physical, chemical, biochemical terms. And I think historically there's a long research, line of research that's really sought to, to understand biological phenomena and simulate biological phenomena in that way.
And I mean, I-i-if historically, you know, the field had believed that the solution to the protein folding problem or the protein structure prediction problem would come from, you know, this kind of first principles simulation, it really, really kind, kinda came, came out of nowhere that, you know, this could be solved using essentially pattern recognition or, you know, this, this, this type of machine learning approach.
So I think, I think historically it has been productive to understand biology through information theory, through information. And, you know, you can think about the cell as, as a computer, as an information processing machine. It's, in informational terms, there are, you know, these, these very basic principles that link the information encoded in the genome to the genes that are transcribed to the phenotypes of the cell that will result.
And so, you know, if we could model and understand the cell at the level of its underlying programs, you know, that, that sort of gives, I think, the right abstraction. What do I mean by the right abstraction? Well, I think, I mean the abstraction that is possible today because we're in the era of information theory at scale.
You know, we're, we're in the-- Cla- Claude Shannon, you know, had this idea of, of kind of the, the ideal predictor of the next character, and he, he had this really beautiful paper where he tried to compute the entropy of the English language and imagine basically, you know, taking an infinite context and then, you know, what is the entropy of the next character?
And at that time, I mean, it was unimaginable to, I-I'd say it create, it took a great leap of imagination to imagine that ideal predictor. But today we're v- we, we get closer and closer to being able to, to build that, and we can do that for text. And so, you know, what would that predictor be for biology?
And that's kind of the idea of, of ESM. It would learn kind of the, the, the underlying structure of, of all biological phenomena. So if you think about that from the standpoint of the cell, if we can collect enough outputs of cellular biology that we can observe to reveal the underlying programs, patterns, and structure, you know, then we could create kind of the information theoretic description of the cell, and I think that would be sufficient to understanding disease.
This is, reminds me of the, a lot of the work that happens in signaling pathways right now, right? Where you have protein in a cascade of different protein-protein interactions that eventually cause a phenotypic change in the cell in some way. How do you translate that into something that can be sort of scaled into a, or maybe it's something else, but how do you, for example that?
Yeah, going back to the bitter lesson.
Data Generation54:44
Yeah. Going back, let's, let's just get back to the bitter lesson.
We, we need data. I mean, I think, you know, why have these advances in protein biology been possible? They've been possible because of, you know, decades, I mean, for, for protein structure, half a century-
Mm
... of, of, of work to experimentally determine the structure of proteins and, you know, and, and the, the, you know, the effort across, you know, the scientific world to, you know, sequence genomes and metagenomes. And so that's created, you know, this, this data set that, you know, you, you, you can really, you know, train a scale on and really learn these, these deeper principles.
And so-
But those two different data sets are actually in many ways quite different. Like PDB, a bunch of, a bunch of very painstakingly-
Yeah. Yeah
... constructed protein structures, which many of which were an individual PhD thesis, and then maybe the follow similar ones came later, which might have been, you know, ten of them for a PhD thesis. Then this-- people always estimate it's like thirteen billion dollars to create the PDB, some like very large number. The reason pe-people created the PDB was because each individual protein was independently useful.
Like, people didn't create it with the sake of we're solving protein structure. They saw that like, "Oh, this protein, we believe, is involved in this dis-disease pathway. Let's understand this protein so we can target it," and so on. Of course, there's some caveats here, but at high level, a lot of this genomic data was, uh, especially for humans or viruses or bacteria, you know, was sequenced for a very specific reason as well, right?
Well, it's great that these are useful after the fact, but I, I wonder if now going forward, especially since, you know, with the Virtual Biology Initiative, um, BioHub's, uh, Virtual Biology Initiative of like half a billion dollars, I think, um, and I'm sure there will be more large initiatives coming from BioHub in the future, you know, you have the chance to be very specific, deliberate, and now collect data for the sake of solving a problem with ML, rather than depending on a data set which was curated, created for some other purpose.
So given that new opportunity, how do you do things differently? How do you think about data collection, um, to enable science broadly when you have the option of doing basically anything from first principles?
A little bit of context. We announced a few weeks ago the Virtual Biology Initiative. Um, we basically said, you know, we're gonna invest, uh, four hundred million internally in data creation and development of technology to scale data generation to be able to increase the, the number of modalities that we can measure simultaneously. We're-- We also, um, uh, announced that we're going to commit a, a hundred million to catalyzing efforts outside of BioHub to generate data.
And so, you know, we, we think that's, you know, a fraction of what's actually needed to, to do this, right? But, you know, the, the hope basically is that by making this initial commitment, kind of giving starting funds to some of the groups that are really thinking about this, you know, working to build different core areas of, of the data that's going to be needed, that, you know, that, that's gonna be a catalyst that's gonna galvanize other, other groups to come in and, and contribute to this.
So that's what we, we really hope to see. You know, the idea is that, you know, this is, this is broad, a broad-based effort, so it's, it's not just us. So, you know, I can, I can say kind of what my perspective is on what data needs to be generated here or what, what can be generated. But, you know, we also want to approach this really collaboratively with the scientific community.
And so-
Mm-hmm
... part of this is, is also kind of hearing from, from scientists what they want. So, so, so from my view, you know, there are a few key principles here. The first is speed, okay? So, you know, it took decades to build the data for proteins, and we can't wait decades. You know, this is-- We need to figure out how to do this in a couple of years.
And you look at the rate that, that, uh, general AI is developing, and it's just, you know, the limitation in biology, we're gonna be fundamentally limited by experimental science and data. And so we really need to, you know, work to address that gap as quickly as possible. So I think that's one key thing is, is looking at what are the technologies that we can scale up today to begin to, you know, give this picture of the information architecture of the cell.
So there's speed, and then there's also, uh, the idea of generalization. So kinda going back to what I was saying before, you know, we want models that can serve as oracles for the biology. They can predict an experiment that you haven't done. And so how are we going to be able to do that? We're gonna need to look at a multitude of different interventions in a multitude of different contexts.
And so it's, it's kind of similar to the principle of training a language model on the internet or training a protein language model across all of evolutionary diversity. What does that look like for cellular biology? And so we have to scale interventional biology, so that looks like things like perturbation biology, pertur- perturb-seq, measurements that where we can look at combined transcription, imaging, other, other layers of the cellular information hierarchy.
And there are, you know, a number, number of groups. Our, our teams are working on this. There are a number of groups across the scientific world that are, that are working on problems like this that are ready, I think, to, to scale. The second is spatial biology, and I think that's gonna be really important. And so that's gonna help us to really understand the cell in context.
And I think, you know, understanding the cell in isolation is, is really not what we need. It's not the goal, right? The cell is part of an incredibly complex system in the body. And, you know, to be able to understand disease, we have to understand how cells interact, the systems that they form, the circuits that they form.
So we need to see that. So spatial biology, I think, is undergoing rapid progress and, you know, is, is an area that's really ready to scale up. That's kind of what can scale now, I think. And, um, BioHub has actually, over the last ten years, really, I think, made kind of pioneering funding commitments in those, those areas.
And so we've, we've, we've funded efforts like the Human Cell Atlas, and we've built, uh, Tabula Sapiens, which looked at, uh, built, built large cell atlases, and we've bu-built CellxGene, which is kind of a database of single-cell transcriptomics. And so we're, you know, we're, we're, we're really looking to kind of build on that. And, you know, I, I don't know how many cells there are in kind of the, the largest efforts.
We're probably around like a billion cells or something like that today, so that we've got to go, you know, multiple orders of magnitude from that. So, you know, that involves scaling the technologies that we have now, but it also involves, you know, a new, the next generation of technology. So we're also funding and supporting efforts in that area.
And there, you know, we really want to look more at cross-modality. You know, can you simultaneously see the phenotype, observe the transcriptional layer, understand what's happening proteomically, link that to the genome? You know, we'd like to-- And the epi-the, the epigenetic state. You know, we'd like to be able to see all of that, and so really pushing technology to be developed faster that can reveal more of those connections and more of that biology and do that in a more scalable way.
It's interesting because when I hear most of those ideas, they're oftentimes the things that people already think about in terms of scaling biology. What is the next technology that is going to allow for, like, enabling data collection technology? Um, going to be back to the theme of Bitter Lesson for Biology, you don't have just scaling laws on, you know, compute and parameters, but now the scaling law is probably in data collection in some meaningful sense.
Where are the next big opportunities there? Like, in true, so you're talking about developing new technology as sort of like the, um, as part of the, this initiative.
Yeah.
Yeah.
So, I mean, I, I think it was basically the same things that I'm saying. As scaling what we have now-
Yeah
... kind of being able to expand the number of interventions that, that we can-
Yeah
... look at, expand the number of parameters that we can measure-
Mm-hmm
... so really kind of more and more multidimensional measurement, um, and, you know, drive down the cost and all of that. So better gene sequencing, kind of better ways of en- encapsulating cells and being able to measure what's happening, not just the transcriptome, but, but other layers simultaneously.
There's an interesting Pareto frontier there about, uh, if you have a fixed budget, how much time do you spend on improving your assay versus how much do you spend on actually scaling it? You know, where do you, uh, went out there?
Yeah.
Like-
We, we have to do both of those things-
Yeah
... right?
Yeah.
'Cause the-- I think with current technology, you know, we can definitely kind of get, get data at ten X to 100 X where it is today-
Mm
... with, like, relatively reasonable investments, you know. But, but then to get another ten X or more here-
Mm-hmm
... that's gonna require, require a lot more technology development.
Yeah.
But the other, the other really big principle is, is going to be feedback.
Mm-hmm.
And so I think that's gonna be really critical.
Mm-hmm.
And I think you can see that as a, a layer of, of technology development that's, that's, that's gonna need to occur, and I think there's a, a lot of great things happening right now in kind of automation, flexible robotics, that's gonna accelerate-
Mm-hmm
... um, where, where that can go.
And the experimental design as well.
Bottlenecks1:03:54
Yeah.
So we typically ask our guests what is a bi- a bottleneck that you would remove that would, you know, sort of unlock things, but we just spent a long time talking about that.
Yeah, I think I answered that question.
So, so-
Yeah
... so, um, but I wanna ask you about it, but I'm gonna give it a spin, which is like, so maybe a little bit outside of your domain, like, so language modeling or supply chain, something that is a bottleneck that is maybe non-obvious and not directly something that you are working on, but that maybe has impact on, on the work of biology or BioHub in particular.
I mean, it's, it's a hard question to answer 'cause there's just so many bottlenecks. I mean, the one that- ... you know, that I always think about is compute, but I think that's a pretty obvious one. It's the bottleneck for, for all of AI in, in many ways right now, but, you know, especially because we're training these large-scale models.
You know, our-- we're, we're always focused on compute, and I think we're, you know, kind of limited both by the data and compute. I think we're in a, you know, we're in a position where I think we, we have, you know, incredible compute resources for a team working in biology. But I think, like, like all teams working in AI right now, really the limit is, is just how much compute power.
Mm-hmm. So if you could hundred X your compute, you think that ESMC would, like, be way better?
I mean, it would definitely be way better. We also need to scale data, so both of those things would-
Yeah
... have to happen in tandem.
Have you basically exhausted what's available right now for-
I don't think so, no.
Okay.
No, I don't think so.
Okay. The large data sets out there, or you could-
Well, there's more-
... relatively, I mean, inf- compute is cheap, right?
More parameters.
Yeah. Yeah.
So, so we trained the ESMC up to six billion parameters.
Yeah. Oh, but I'm saying in terms of data available, like, have you exhausted most of what's publicly available in terms of, like-
No
... genomic?
No, not, not yet. And then, you know, the atlas-
Mm-hmm
... that we just built actually has more sequences and structures-
Yeah
... than ESMC was trained on. Yeah.
So definitely have a little room to go.
It-- so how-- like what's-- is that a order of magnitude jump or twice as much? Like how does that, how does that work?
Yeah. I mean, I, I think ESMC is trained on, you know, say, order of a billion sequences.
Mm-hmm. Mm-hmm.
So there's, there's definitely probably order of 100 billion sequences.
This is lar- a lot of them are largely redundant.
Hu- 100 billion?
Yeah.
Yeah.
Oh, okay. To get that billion, you whittled down from six billion, 6.8 billion, right? So of those 100 billion, if you were to similarly cluster and find unique ones, what do you think-- where do you think you would land?
The sequences aren't actually redundant, right? It really depends on what you mean by redundancy.
Mm-hmm.
Because there's, I think, a tremendous amount that you can learn from small genetic variations, right?
Mm-hmm.
'Cause these are, these are really revealing of, you know, kind of the, the very basic determinants of protein structure and function at a very fine level. So I think that, you know, as we think about protein space, you know, having a vast diversity of sequences across a wide range of protein families is, you know, really critical for the emergence of this kind of structure prediction capability because I, I think kind of large diversity is what trains the model to understand to develop a representation of structure.
But I actually think that to develop a representation of function, it's these very small variations that are important.
Mm-hmm.
And so I, I do think that there is probably a lot more. You know, it's, it's like the models haven't yet been trained at that level of kind of just, like, really deep understanding of these very small but critical patterns and, and sequence. I mean, a single-
Right
... a single mutation is enough to destroy the-
Mm-hmm
... the function of a protein.
So you could, you could conceivably actually take all 6.8 billion of those, retrain, everything's the same, but-
Yeah. Yeah
... you do atlas.
No, you could, you could train on more than that. There are even-- I mean, that's kinda clustered down, so.
Yeah.
Maybe the question is how far-- when do you hit the law of diminishing returns here? I mean, it sounds like you have plans for an ESM4 or an EM- ESMD or whatever you wanna call it.
We're always developing the next thing.
Yeah.
Yeah. So yeah, but I- I'm just wondering it's, you know, at some point, is this actually something that you could exhaust? You know, people talk about exhausting the pre-training data, uh, on, on the internet or something.
Yeah, I mean, at, at some point, yeah.
Yeah, yeah.
At some point.
Yeah. I mean, is it-- but it sound actually something you could s- conceivably imagine in, imagine doing in the next few years. Or even if you don't exhaust it You hit a lot of diminishing returns for, you know, the applications that you're trying to predict here, where maybe your resources are better spent somewhere else.
I mean, it's, it's basically, it's an empirical question, right?
Yeah. Mm-hmm.
It's truly an empirical question, and so we, we just, we just don't know, you know?
It's-
I mean, with ESM2, we weren't sure-
Uh-huh
... 'cause there, there were some diminishing returns.
Mm-hmm.
With the ESM-C, you know, now, now there aren't, right?
Yeah.
So you can kind of look at, look at that, extrapolate from-
Mm-hmm
... from the scaling law there, and, you know, we- there is enough data to train that next, next model, so.
And the other question that we usually ask is any call to action or what, what do you want people to go take action on? If the listeners want to get involved, get hired, get build things, what would you ask people to do?
Outro1:08:52
Well, we're, we just announced, or, or I should say we are, we are going, at the time that this, this podcast comes out, we will have announced ESM-C and this, this world model for protein biology. Um, it's gonna be open source. It's, it's gonna be MIT licensed, and we want people to use it. You know, we want, we want this to be a tool that can unlock science.
We're excited to collaborate. We have a team that's, that, that works on that, and we want to hear from people and understand, you know, what, what we can build that can help to accelerate their science.
Awesome.
Yeah, we might have a, um, a demo/paper club of some sort on this channel, so stay tuned.
Yeah, stay tuned for that.
Uh, yeah.
We'll, we'll invite you and your team, um, whoever can make it. We'll feature this paper once it's in final preprint form and spend some time on it for an hour on the Wait in Space paper club.
Yeah. Uh, thanks for chatting with us.
Awesome. Yeah, great to meet you guys.