Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

Intro0:00

Ethan He0:00

I have a pretty big claim. The visual intelligence are actually mostly coming from language. Like, 'cause these video models, especially from now, since the diffusion model technology is more mature, like every time you see there, there's some improvement on these models, I, I would say mostly the, the gain comes from language model, not, not coming from the, the vid- the video model itself, like the, the video research models themselves.

Swyx0:32

Before we get into today's episode, I just have a small message for listeners. Thank you. We would not be able to bring you the AI engineering, science, and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content. We've been approached by sponsors on an almost daily basis, but fortunately, enough of you actually subscribe to us to keep all this sustainable without ads, and we wanna keep it that way.

But I just have one favor to ask all of you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the only thing I'll ever ask of you, and it means absolutely everything to me and my team that works so hard to bring The Latent Space to you each and every week.

If you do it, I promise you, we'll never stop working to make the show even better. Now let's get into it.

From Cosmos to Grok1:22

Swyx1:22

Okay, we're here in the studio with Ethan He, uh, most recently of xAI. Welcome.

Ethan He1:27

Yes, thank you. Glad being here.

Swyx1:28

We're also here with Vibhu. Uh, you were first coming to us or joining the Latent Space world because you were working on Cosmos at NVIDIA, and you did a-

Vibhu1:36

Yeah

Swyx1:36

... great paper. We loved it. Uh, you presented it as well, so thank you for doing that.

Ethan He1:40

Yep. I've actually, I also presented the MOEs-

Vibhu1:44

Yes.

Swyx1:44

Yes

Ethan He1:44

... yeah, twice at Latent Space.

Swyx1:46

Yeah. Yeah. How did you actually hear about us? Did we reach out to you? Is that how it worked?

Ethan He1:50

No, actually, I-- The, the community, like I, I realized, oh, there is this online community-

Swyx1:56

Yeah

Ethan He1:56

... that people talk about AI and also learn, learn from each other through papers e- every, every week through the Paper Club. It's, it's very nice.

Swyx2:06

Yeah. It's, uh-

Ethan He2:06

I learned a lot

Swyx2:06

... I think three years nonstop. We haven't stopped even on Christmas and New Year's. S- ma- many weeks I want to stop, but it keeps going.

Vibhu2:15

No, no, it's good. I think you had posted that you worked on a paper, and I was like, "Oh, very cool. We have Paper Club. Present then."

Swyx2:20

Yeah, yeah.

Vibhu2:21

But I might have reached out to you after.

Swyx2:23

Yeah, y- y- because it's an amateur club, right?

Vibhu2:25

Yeah, yeah.

Swyx2:25

Uh, so it's very un- unusual, and but we have sometimes paper authors come by and, and actually explain the paper. Today we just did, uh, the poolside paper, which was apparently very good.

Vibhu2:35

Came out yesterday.

Ethan He2:36

Nice.

Vibhu2:36

Uh, pretty interesting, right? Fully open. They talk about everything, systems. So it's a good one. We'll, we'll recommend-

Ethan He2:41

Yes

Vibhu2:41

... people to read it.

Swyx2:43

Bring us up to speed on your transition to xAI, 'cause I actually don't even know when you joined. Uh, just like tell the, tell the story about the sort of transition.

Ethan He2:51

Before xAI, I was working on Cosmos world model as in-- at NVIDIA. So Cosmos is a, it's a giant video foundation models that can... that aims to simulate the world, and for-- it serves as a foundation of-- for all of the roboticists to build on top of. There, once I built the Cosmos 1, I realized as this thing also has a scaling law similar to language model, we need to scale, scale up the video models further.

Uh, that's, that's why I realized I need to move to somewhere with much more compute resources. That's how I-

Swyx3:30

Than NVIDIA?

Vibhu3:32

The GPU glitch king themselves.

Swyx3:36

Yeah.

Ethan He3:36

Yeah.

Vibhu3:36

And timeline-wise, when was Cosmo? It was pretty early, right? It was open world model, open paper, everything.

Ethan He3:42

Yeah. It was like, uh, end of twenty twenty-four.

Vibhu3:45

End of twenty twenty-four.

Ethan He3:47

Yeah. Then at, at mid twenty twenty-five, I moved to xAI. At that time, I, I joined about the time when xAI was about to build video models and in multi-model models. There were no, no infra, no data, and no model, and it just-- as a few engineers, we, we built it in three months and released the first model, Grok Imagine 0.9.

And since then, I, I keep working on video models and move more from pre-training and to post-training of the video models. For example, like reference to videos, kind of like the cameo feature and, uh, video extensions. And, uh, and, uh, before I left, I, I worked on a world model, leading a small team to, to focus on the real-time long horizon video generation.

Swyx4:41

Can you give like a rough roadmap of like, okay, you're on a brand-new team. Grok previously was only text, or they partnered with BFL for-

Ethan He4:48

Mm-hmm

Swyx4:49

... uh, their, their image gen stuff. What do you-- What, what are the building blocks, right? You have compute. Data you can procure somewhere. Like, just the, you know-- What, what are like the sequence of things that people should think about when you're setting up a new team?

Vibhu5:00

I mean, actually even deeper, not just data you can procure. You guys had to go through getting the data too, right? So-

Ethan He5:06

Yeah

Vibhu5:06

... you shipped it pretty fast, but yeah, I'll put it-

Swyx5:08

Yeah, three months is like-

Vibhu5:09

From everything

Swyx5:09

... actually like very surprisingly fast.

Ethan He5:13

Yeah, one thing I, I say like thanks to my experience at NVIDIA, 'cause first time when we were building Cosmos together, we built it, uh, for about a year. So, so this is like the second time I do it. Roughly, roughly have an idea like what to do. I say the most important thing is, is the talent.

Everyone, everyone were very strong and clever, ve- very close with each other towards a common goal. So that speed up things a lot. So you reduce the communication bandwidth among people, and everyone can, can work towards the same goal. It's, it's like every day there's not that much meetings on the calendar, like maybe like a, like a, a sync a day, and after that it's, it's just all building.

It was pretty fun at that time. And another thing is that xAI has very strong Foundations of like data, data inference, model inference, and the, the supporting there can, can help the model develop a lot. When I look at like training models, I don't-- Uh, so actually the, the top important thing is like how many, uh, how many iterations can you do like per, per day.

Uh, and the, the more iteration can you do, you can, you can train the model much faster. So if you have very strong infra and you, you have a lot of compute, you can, you can train these models in very short period of time. That can give you a, a much larger buffer to, uh, for errors, and it also gives you the opportunity to spot more bugs.

Uh-

Swyx7:03

Yeah. What, what is an iteration? Is it like a, a few hundred steps or what, what are you-

Ethan He7:07

Let's say just the train- training the model, like from acquire new data and maybe design new algorithms and train, train a new model, maybe at smaller scale or-

Swyx7:18

Yeah. So cycle time for like any hyperparam that you're searching.

Ethan He7:21

Yeah, cycle time and tune-

Swyx7:22

Yeah. Yeah, yeah

Ethan He7:22

... to like eval this model. Is this model better than my previous iteration?

Swyx7:27

Yeah.

Ethan He7:28

So-

Swyx7:28

So it's like before you, someone had already set this up that you can iterate very quickly.

Ethan He7:32

Yeah, I think the, the foundation there is, is extremely good for developing and research, research models. And often I find is this is kind of boring, but like a lot of the improvements does not come from new algorithms. It comes from finding small bugs here and there in the data pipeline, in the, in the model training pipeline.

Those give, those give the biggest boost to the model quality.

Vibhu8:03

It's interesting, right? So you say it's like small team, less communication bandwidth, but also a lot of quality is like find little bugs. It seems counterintuitive, right? You have a lot of people, you can iron out more of those, but it's interesting to see the other side, right?

Ethan He8:16

Yeah. Yeah.

Swyx8:17

I also wonder, have you-- do you try using LLMs to look for bugs? Like, I don't know.

Ethan He8:22

I remember at that time it was mid two thousand and twenty-five, so it's the, the coding model wasn't quite there yet. I remem-em-- I remember like December two thousand and twenty-five it was extremely good. Yeah, I've been, I've been using it at that time. It's, it's helpful. Uh, sometimes it, it produce codes that are kind of difficult to maintain.

Even though like the first time it built something extremely fast, but it gave the, like a spaghetti code, thousands of lines that I couldn't maintain, and the LLM itself couldn't figure out what's, what's wrong and how to improve on top of it. But now I find it much, much, much better. Yeah, I want to bring up another point here is like now coding models are much more efficient and can help, help us implement stuff much faster.

Compute might become a bottleneck again because previously, like i-if you want to train a new model, say you want to generate new synthetic data and then or write a new algorithm, it might take a few weeks, and during that period of time you don't-- you might not have experiments to run. But now you, you can build that thing within a few hours, then you can immediately train a model.

Now you have to have enough compute to, to try all of the ideas. So compute might be the bottleneck of iterating speed again.

Swyx9:53

Mm-hmm. Mm-hmm. Yeah. Um, yeah, I, I actually honestly, I think it's like kind of a stressful job because you're like, "Well, I should be trying everything, and if I'm not, then I'm not doing my job well."

Vibhu10:05

I mean, there's also the stress of you're eating thousands of GPUs per hour, which is very expensive and, you know, compute can go to other researchers.

Swyx10:14

You got the good daddy Elon to-

Vibhu10:14

You got daddy Elon.

Ethan He10:16

Right. It was-

Vibhu10:17

But you know, there's still finite amount of compute. Like you want to use it, you want to use it well, you want more of it.

Ethan He10:23

That was quite stressful indeed. Yeah, I think one, one thing is the-- with coding models now, like a lot of these jobs can be automated, which is much better. A second, it's a, it's a marathon, so you, you got to maintain good health and, uh, a regular schedule.

Vibhu10:45

It's, it's hard to hear that when you shift from zero to nothing in two months.

Swyx10:49

Yeah. I mean, and, uh, I think obviously the, the culture at xAI is very famously, uh, you know, people, people work very hard. Um, o-o-one thing I, I did want to dive into, you know, in our-- in the notes that you, that you sent ahead of time, uh, you had specific comments about the cost of VideoGen training.

Uh, presumably this is on the Colossus-1, right? Uh, the, the two hundred megawatt cluster.

Vibhu11:11

Yeah.

Swyx11:11

And whatever you want to share on that.

Vibhu11:12

I think there's, there's three things we're talking about, right? So there's VideoGen, there's also the ImageGen model that you put out. Do you wanna like complete the-- okay, so zero to one, you have a few months. Just what are the stages of create ImageGen model-

Swyx11:23

Oh, yeah. Maybe I got distracted.

Vibhu11:24

Sorry. Um, and then, you know, from there, there's VideoGen, there's AudioGen. Would love to get into those next. But what is that first few months like? So small team, lot of bugs, iterations, but like, you know, what, what does it look like? Do we take something off the shelf? Do we just get data compute? What's, what's the few months like?

How do you go to state-of-the-art ImageGen model? How do you just start?

Building Video Models11:41

Ethan He11:45

Yeah. I cannot comment specifically-

Vibhu11:47

Yeah

Ethan He11:47

... how xAI did, but it's, it's a quite standard process. I can draw, draw some, uh, ex-examples from Cosmos. So mainly it's like, uh, building, building a video model, you actually need to build a image model first. And building, building these two models, the data you need is 100% synthetic pair of language and image or language to video.

Because on the, on the internet, actually the, the videos don't naturally associate with text. So you can say, oh, like on YouTube, you have the title and you have the description and the comments-

Swyx12:28

That's all

Ethan He12:28

... of, of a video, but usually they're not relevant to, to the video itself. And say maybe like the video is a natural scene of mountains or something, and the title is, uh, like, uh, I'm so happy today. So they have n- they have no, no correlation at all. So the first step is to, you have to generate synthetic pair of language with, uh, videos.

So you gather videos from the internet, and you, you use a VLM to caption the videos. So that part, here's a question, like how do you, how do you gather VLM to begin this? So like if there's no-

Swyx13:12

You, you fuse the model, right? Like-

Ethan He13:14

Say if there's no like VLM exists, like how do you generate the, the text to the beginning, right? It's, it's impossible.

Swyx13:21

I see.

Ethan He13:22

In the beginning, it's like you ask human to describe the, the video as, as detailed as possible. For example, you ask them to describe everything, like all, all objects, all characters, and all interaction and dialogues in the, in the videos. So that's in the protocol of Cosmos labeling. They require the, uh, the objective they give to the labelers was that you have to describe the video as detailed as possible such that a blind person hears a blob of text can reconstruct what the video is like from, from their head.

Swyx14:00

Video or image? You're talking about images.

Ethan He14:01

Video or image. Either, either one of them.

Swyx14:04

Okay.

Vibhu14:04

This was pretty common when we went from like, uh, CLIP and DALL-E, right?

Ethan He14:08

Yes.

Vibhu14:08

It's all training on really detailed captioning of images. So same-

Ethan He14:12

Yes

Vibhu14:12

... is applied to video, but instead-

Ethan He14:14

Yes, same apply

Vibhu14:14

... of using multimodal model to pass in video images and write rich descriptions, you can also...

Swyx14:20

I mean-

Ethan He14:21

Yes

Swyx14:21

... I, I think that's the traditional perspective of supervised, uh, or, you know, very highly human curated thing. I feel like there's a unlock with unsupervised, right? Where like you, you have enough to bootstrap that you can just throw common corpus on it or, you know, whatever. Uh, like u- unsupervised vision and language pairing, right? Like where you just have, uh, uh, interspersed image and text and it just learns.

To me, that is the VLM breakthrough that is different from the CLIP, different from the, the, the, the pre-LM era.

Ethan He14:53

Yeah, yeah. It's interesting to see that you kind of need both data.

Swyx14:58

Yes.

Ethan He14:58

For example, for-

Swyx14:58

You needed to bootstrap it up. Yeah

Ethan He15:00

... yeah, for the generative model training, there's also usually like a small percentage of unlabeled data. So, so the model is instructed to generate a video without any text instruction. That, that can also help the model generalize. So after, after this stage of generate the synthetic pair, so, uh, one, uh, one important common step is to train a compressor or a tokenizer of the image or videos.

So because, uh, if you train-- If you can technically, theoretically train image or video models on pure pixels, but the, the problem is that the-- it's, it's a lot of tokens. So like one, one image, like, uh, it's 1,000 by 1,000, it's like 1 million tokens, 1 million pixels. It's impossible to train transformer on that. So it's, uh, you need to train a tokenizer which can go from image to latent space and latent space back to image.

Swyx16:02

That's why we named the podcast.

Vibhu16:04

Exactly.

Swyx16:05

But, uh, basically, you're talking about vocabulary size.

Ethan He16:07

Yeah, so, so like-

Swyx16:09

And so like what is, what is imp-- Like a million is impossible?

Ethan He16:11

In generative models, the, the vocab is continuous. It's a continuous space. We can think about like you map an image to a vector. It's a, it's a fixed length vector. It's like, uh, 16 or 48, something like that. And then you, you map that vector back to, to the image space. And the, the mapping is, uh, has-- The mapping is patch-based.

So you say you have a 16 by 16 patch-

Swyx16:40

Yeah

Ethan He16:40

... and you match-- you map that patch of pixels into this-

Swyx16:45

Yeah

Ethan He16:45

... latent space.

Swyx16:46

We've covered this in the vision transformer.

Vibhu16:49

This is what like VAEs-

Ethan He16:50

Yeah, VAEs

Vibhu16:51

... you, you basically compress your input. You do your generation, you're reasoning all that generation in smaller dimension, and then you project back out.

Ethan He16:59

Yeah.

Swyx17:00

VAE is a form compression, but I think the-

Ethan He17:02

Yeah

Swyx17:02

... the, for me, the patching thing is from VIT, right?

Ethan He17:05

Yeah. You can make both.

Swyx17:06

Literally the, the, yeah, the, the paper is titled, like 16 by 16 is all you need. Uh, something like that.

Ethan He17:12

Yeah.

Swyx17:12

Um, and then I think also, uh, people make a lot of comparisons with this kind of patching with convolutions.

Ethan He17:18

Yes, yes.

Swyx17:19

Which is you're, you're kind of re- reconstructing the old paradigm with the new.

Ethan He17:22

Yeah. Actually, in VAEs, there are, there are both convolution networks and transformers. You c- you can actually do both.

Swyx17:30

Yeah.

Ethan He17:31

After this VAE, so what you got is you've got latent space tokens, and you've got the, the language tokens. So now the training, training of the diffusion transformer, usually generative models use diffusion transformers. It is actually quite standard. It's, it's very similar to how you train, uh, language transformer models. It's not that much difference. It's just the tokens, the, the visual tokens in, visual tokens out.

The only difference is there's a denoising process. So you train the model to- Unmask some of the noise. So you, you add, you add random noise to the visual tokens, and then you train the model to remove those noise to, to generate the clean tokens. And in inference, the model can iteratively remove noise from 100% noise.

Swyx18:29

Yeah. And then there's also, uh, to speed things along on the tech, tech tree of diffusion, there's CFG, like, uh-

Ethan He18:37

Yes

Swyx18:37

... uh, and then there's, there's also, I guess, latent diffusion that, uh, you know, is, is somewhere in there. I think, uh, somewhere along the line, uh, uh, obviously, like Stability and all these other guys, uh, pioneered a lot of this, like, um, architecture. I don't know if you want to get into that or just, or do the, the video side.

Up to you.

Ethan He18:54

After you train such model, such image model, the reason it's a, it's a foundation for video models is that image, image models are cheaper to train, and they have much denser connection between language and text. So, uh, sorry, language and images. For example, you, you train a billion, you train a billion images, and there's a mapping from, from the text to, to the image, and the cost to train the same, like the, a billion, a billion text to a billion videos, that, that's much more expensive because videos naturally have more tokens than images.

Because the diffusion models, their understanding of, uh, language purely come from this, this mapping. So if you don't have enough mapping, so if you only train on like a 10 million videos or something, there-- you might not see enough language tokens in your training, so your model does not understand human intention enough. So that's why you, you really, you train, you first train this image diffusion models, and then you bootstrap the video model from there.

Swyx20:11

One thing I did want to ask, because I-- actually, I think you're, you're the first per-- video model person I've ever talked to, I think. Uh, we, we've, we've like talked to Luma and, and all those folks. There, there's all these tricks in video compression where basically frame by frame there's not that much difference, so actually you don't have to regenerate or resave the whole frame, right?

Uh, but I think MP4 compression or something else like that.

Ethan He20:33

Mm-hmm.

Swyx20:33

Uh, is it tempting to use that? Or as far as I can tell, everyone just treats it as, "No, we would just generate every frame." Is that roughly the state-of-the-art?

Ethan He20:44

There are a few different approaches. Let's say first, like you, you want to just directly use MP4 compression and use, use that as the tokens for the transformers to train, right? So people actually have tried that, but the, the main challenge is the latent space for the MP4 tokens were not, were not very comprehensible for the models.

It's, it's extremely hard to train on that. And there's a-- So that's why they created VAEs, which creates more continuous, uh, latent space, so the models can understand the latent space and learn from it much easier. Even within the VAEs, there are different difficulties of the latent space. So you, you can imagine something the, the simplest, the most naive VAE is like you, you have an image and you just shuffle all of the images into a, into a vector.

So you don't need to train any VAEs, right? But that latent space is extremely hard for- ... models to train on top of. That, that's why there's some debate on like how do you compress the, the tokens. So, so you mentioned like you can compress frame by frame. Also, you can compress, uh, the temporal dimension.

Swyx22:09

Yes.

Ethan He22:09

The difference is if you compress the temporal dimension, you, you get a much higher compression rate because there, there's temporal redundancy between frames because, uh, this frame and the last frame, likely they are mostly similar, so there's only some small difference. Uh, for example, like, uh, I think in 1.2.1 VAE, they have like a eight by eight by four compression rate.

So the, the four temporal tokens are compressed into, into one tokens. That can save a lot of, save, save a lot of the context length. If you do it frame by frame, you have to do maybe like eight by eight by one. Your context length will be four times larger. That being said, the benefit of the frame, per frame compression, we might come back to this later, is, uh, real-timeness and interactivity.

'Cause if you, if you strain the output of the model, uh, frame by frame, you can-- uh, the model can respond to any user request immediately. So if you have like a temporal four, uh, four compression, four times compression, then-

Swyx23:23

It might be laggy

Ethan He23:24

... yeah, there, there's a lag there in nature.

Swyx23:27

So you're very pilled on this. Uh, let's just go ahead and bring it up 'cause we have the visual prepared anyway. There's some frontier applications of realtime video gen. So Flipbook is one of the examples that went viral recently, right? What is Flipbook?

Ethan He23:40

Flipbook is, is kind of like a web brow- web browser. You can see like it has the web bro- browser UI on top. The difference is all of the UIs are generated by generative image model in real time. And anything here are, are fake. But you can, you can explore inside this wor- this imaginary world. Say Like we-- here we have engineering the Great Pyramid.

Like the, the model generates this for us to understand how it works, and if we want to navigate around and understand further, we can click on some of the, uh, some of the description here. And the model will generate a new page, new subpage describing the details we want to know about.

Swyx24:31

So it's basically a kind of we're playing a video, uh, but it's pausing for our next interaction, and then it just plays the next thing based on our interaction.

Ethan He24:40

Yeah.

Swyx24:40

Which is kind of cool.

Ethan He24:41

Yes.

Vibhu24:42

Yeah, and you, you kind of decide your story. So this was, you know, how do you make a pyramid? Uh, levering technique seemed interesting, right? It shows how do you take... Okay, I wanna know what is this-

Swyx24:53

The de- the demo tweet had more animation between frames.

Vibhu24:55

I think it's just skipping, um-

Swyx24:56

No, it's just skipping a lot of frames.

Ethan He24:57

Yeah, they also have a video mode, but, uh, I guess a lot of people are using it.

Swyx25:02

Yeah, it's here. It's, it's-

Ethan He25:02

So yeah.

Vibhu25:03

There's a live video stream. We can try, um-

Swyx25:07

Yeah. So, so this is an example of the kind of future that you see at the extreme. We don't-- we're obviously not in it today.

Ethan He25:13

Yeah.

Swyx25:13

But in a world where inference is completely free-

Ethan He25:15

Yeah

Swyx25:16

... this is better than generating code in text.

Ethan He25:19

Yeah, so- ... this is, this is a final state of where we will be at for word model, I think. Imagine, imagine internet doesn't exist, and then you type in google.com. Like what should, where should, what should a model show you? Um, the model can imagine something, and this is what the model imagine. And these web pages, they completely do not exist.

So I think as the inference costs come down, we are going to have generative UI for everything. If you think about how the coding model works, so they, they write code for a web page, and they render the code might be con- converted into binary, and the binary render the pixels on the screen. So we... in machine learning, every time we have some breakthrough, obviously it's, it's more intuit.

So why don't we have like user instruction to the pixel directly? So the generative UI will be user intention to, to the pixels directly. And say like even if I want email, let's say everyone, everyone have the same interface, but I want, I want it slightly different. I want the email to show, show to me like a TikTok, so I can swipe left and right for the emails.

And or maybe you want something else. We can have completely different things. Or like I have... I'm looking at, uh, Instagram stories, and I don't like the like button. I always misclick it, and, uh, generate the UI without it. So it's going to be a revolutionary replacement of the interface. So in the future, we might have much more powerful LLMs and coding models running behind the scene, and in the, in the front end, the diffusion model will actually be the front end to show stuff to you.

That's how I imagine it.

Swyx27:19

Yeah. Diffusion front end, deterministic back end.

Ethan He27:21

Yes.

Swyx27:22

Something like that. I find that very expensive, but, uh, you know.

Vibhu27:25

I find it interesting you called LLMs writing code on the back end deterministic, but okay.

Swyx27:31

Yeah, you write it once-

Vibhu27:32

Compare, compare-

Swyx27:33

... and then you execute.

Ethan He27:34

If, if you think about the cost, say, let's say H100 costs one dollar per hour, and if you use this eight hours a, a, a day and thirty days, so, um, every month you are paying this two forty, you, you'll actually not, not wanna pay for that. That's even more expensive than Cloud Code Max. But i- if you think about the, the compute costs come down like two times every year, and I think the, the future will actually arrive like within a few years.

Vibhu28:06

It's interesting. Com- compute cost comes down, compute gets faster, model gets smarter-

Ethan He28:11

More efficient

Vibhu28:11

... model gets smaller.

Swyx28:12

Yeah, I don't know why you say two times because I think it's like a hundred times. In language models, it is roughly one hundred to a thousand times every twelve to eighteen months, uh, for the same given level of LMSYS, uh, ELO.

Vibhu28:25

That, that's a net of everything, right? That's model performance alongside compute. So different than just compute costs come down. But, um, you know, a very interesting future.

Swyx28:36

Yeah.

Vibhu28:36

Um-

Swyx28:37

So the, the web designers will have to shout out that accessibility is an issue, right? Like, you know, how do you deal with screen readers or whatever. But yes, uh, this is higher bandwidth storytelling than anything you can possibly generate with code, right? So I think that's the, the rough idea.

Ethan He28:51

And I'd like to add a little bit that so human naturally have the maximum bandwidth when, when we are looking at things, look at videos, and, and we also have maximum output bandwidth when, when we are talking. So in the future, it might be something like we, we talk to AI models, and the AI model responds back with a generative UI.

So that, that would be the maximum input and output bandwidth to interact with AI models before Neuralink happens.

Vibhu29:23

And I mean, it's also very custom, right? Some people are very visual, some people are not as visual, right? They prefer the text. But the best thing about generative UI, right, it can also be text.

Swyx29:33

Yes. There's another project that we w- wanted to highlight, which is the Neural OS. Kinda similar idea, but here you're literally operating, uh, uh, simulating an operating system with a video model.

Ethan He29:44

Yes.

Swyx29:44

Um, and you can play Doom, you can do Firefox. Uh, I find this like mildly less impressive obviously because it's an OS that I can run.

But here everything is imagined. Um-

Vibhu29:57

I, I was, you know, used to the Command+W to close the Firefox tab. I-- that didn't crash. That's actually-

Swyx30:03

It's too, too immersive.

Vibhu30:03

It's, it's too immersive for me.

Swyx30:04

Too immersive.

Vibhu30:05

I wanted to close the tab.

Swyx30:06

Yeah.

Vibhu30:06

But yes, I can play generated diffuse-

Swyx30:08

You know, this is shockingly fast

Vibhu30:10

Yeah

Swyx30:11

Because I, I, I remember there was a demo about, like maybe one to two years ago, someone tried to do the first-person shooter with a image model. There was no consistency. It was very slow. But here it looks like realistically, it's-- this is Doom.

Vibhu30:24

I mean, I think there's two sides to that, right? There's like, okay, what is running a game? The heavy part of it is actually the game engine, all the lighting, all that stuff, the graphics. This is just kind of video, right? Like we've solved consistency. This is still, you know, it looks like a few years old image generation.

There's some temporal consistency, but it's, it's kind of just images stitched together as frame video. But it, it's a good visual representation to pi- to picture the future you want to see, right? Like that's, that's what I see in these more so.

Ethan He30:55

This reminds me of how, how the video models gets better and better. So Neural OS is kind of... I- if you just look at it, it feels like it's just a, a, a crappy version of the, like the, the windows we could have, right? And, uh, but, but the difference is, so the model-- this model is overfitted on the, the existing operating systems.

It, it can generate nothing different than that. But it's actually also similar to video models. So when we are training these video model, image model, we train them on internet. There's no imaginary supernatural stuff on the internet. But once we train this model, you can prompt the model to generate something supernatural that have never existed-

Swyx31:44

Yeah

Ethan He31:44

...in the dataset. So if you train your Neural OS or neural computer on the standard screen recordings on the entire internet, the model can imagine completely new interface to, to interact with the computer.

Swyx32:00

Yeah. This is one of those things that is magical to me. Uh, usually generalizing out of distribution is bad, but somehow we have learned some kind of internal world model-

Ethan He32:11

Yes

Swyx32:12

...that you say, you know, this plus, but it looks like rainbows and butterflies. It'll do it, and it will m- kind of make sense.

Ethan He32:20

Yeah.

Swyx32:22

So yeah, that's kind of cool. Yeah, I, I don't know if there's any comment more, more on there. I, I, I do, I do wanted to-- I did wanted to touch a little bit more on the model architecture stuff, which I think you were getting. It's, like really fascinating. We, we don't get a chance to talk about this enough.

So one of the papers that we covered, we've covered every annual, uh, Segment Anything release, uh, and I don't know if you follow. I mean, you're a computer vision guy, so you-

Vibhu32:43

Yeah, I know

Swyx32:44

...you know. So they, they did memory attention, which is kind of interesting. And I always think, like anything where you can, across the temporal dimension, keep some consistency, um, I think it's like very fascinating. And I don't know if... Basically, like does that-- the CV side bleeding into VideoGen side, I think is underexplored, right? Like we talk about it for labeling, but actually you can borrow the architecture itself.

Vibhu33:07

And there's, there's also complete different approaches, right? Like you, you brought up the term world model, so we went from video model to world model. There is diffusion, but there's also other approaches that people are doing. So maybe we get into those after as well, you know.

Swyx33:21

Yeah, yeah. He has a whole definition of world models-

Vibhu33:22

Okay. Okay

Swyx33:23

...and stuff. I, I feel like we threw a lot at you. Whatever you want to comment on.

Vibhu33:27

I, I think one thing that we should actually comment back on is like, okay, so we were talking about the steps to train ImageGen to video model. One thing we don't see as much of is like, okay, you brought up the delta in training data, right? So you won't have as much a video model might not generalize, but what is the cost of training a large video model?

So we know for-

The Cost of Training33:47

Swyx33:48

Mm-hmm

Vibhu33:48

...LLMs roughly, okay, uh, even like the poolside thing that came out today, right? It's a Gemma level model trained on roughly forty trillion tokens at this many H200s over this much time, right? You can see what is the exact cost of that. So how many GPU hours over how much H200 cost? So how, how do we do the backend math of, you know-

Ethan He34:07

Mm-hmm

Vibhu34:07

...same thing for video models, image models. How do you, how do you kind of break that down?

Ethan He34:11

I can share some back-of-the-envelope calculation. So surprisingly, video models is like the, the cost is very-- is comparable to language models and obviously the largest scale is language model, maybe like a medium scale to language models. I said just storing the videos alone, it, it costs a lot. You can, you can maybe look up on AWS or something.

Swyx34:37

Mm.

Ethan He34:38

You really like, say if you have a billion videos and let's say, let's just say like each video, like five megabyte, then you need like, uh, five petabyte to just store those videos. And also remember we talk about you use a VAE to compress the videos, and you also need to store, typically you need to store those continuous feature on-- also in your storage.

That's also comparable size with the videos themselves. So just storing, storing these videos and the features is, is tens of petabytes alone. And, uh-

Swyx35:15

I, I just, I just looked up the calculation. Five petabytes on S3 Standard is one hundred K per month.

Ethan He35:20

Okay.

Vibhu35:21

It's double. And you need-

Ethan He35:23

And then like tens of petabyte is two hundred K. And even more expensive is you have the ingress and egress.

Swyx35:30

Oh, yeah.

Ethan He35:31

Like you store it in the internet. You have to just to download those videos, I believe it's, it's more expensive on AWS than just storing those videos.

Vibhu35:42

Storing, yeah.

Ethan He35:42

And each training runs, you probably need to pull them once. If you train multiple times, it's, it's even more than that. So, so it's like just storing, storing the network, those costs is, is just, uh... I guess it would be a few, a few millions per month to just storing everything, not to mention the GPU cost. Yeah.

Vibhu36:02

Okay, my, my side tangent, like, you know, the compute rental, like GPU rental is very efficient. There's one side, okay, you can be xAI and build your data center. Should we not just build our, like storage compute as well? Like-

Ethan He36:14

Of course

Vibhu36:14

...cloud cost compared to just, you know-

Ethan He36:16

You save so much

Vibhu36:17

...store. Yeah, exactly.

Ethan He36:18

Yeah.

Vibhu36:18

Especially with like egress and stuff. So, you know.

Ethan He36:21

That's a good idea, but it also comes, comes to there, there are some of its own challenges.

Swyx36:26

Of course, of course.

Ethan He36:27

Yeah, like people who build the GPU data centers, they might not expect this much, uh, uh, storage. And yeah, people build storage, typically they just build it somewhere with just CPUs.

Swyx36:40

I just looked it up. Five-- uh, AWS only charges for egress, not ingress. Tier five for five petabytes is two hundred and thirty K.

Ethan He36:48

Yeah. Yeah, even more expensive than storage.

Swyx36:51

But storing is per month, right? You check in, then you cannot check out. Uh, so it's so cool. It's okay. So, so that's that side.

Ethan He36:58

So, so the TLDR, you know, my, my backhand math-

Swyx37:00

Data, data is larger than you think. Yes.

Ethan He37:01

Yeah, my backhand math of GPU hours times GPU cost is also very much... You know, I'm missing some storage.

Swyx37:06

You're also-- you're basically like also more IO bound than normal training.

Ethan He37:12

Yes. Yes.

Swyx37:13

'Cause like data loading, so caching everything, it becomes super important.

Ethan He37:17

Yeah. So in Cosmos, we did a lot of optimizations to make it not IO bound, so. Um, yeah. Uh, speaking of the training, actually training the models at GPU cost, if you look up like the-- so open source model, how big these video models are, think like LTX has nineteen B parameters. That's a dense model. And people are also exploring, uh, MOEs, so it might be like, uh, twenty B active and, uh, like a hun-hundreds B, uh, total.

So that's, that's even-- That's similar size as medium-sized LLM models. And if you, if you look at number of tokens, uh, we disclose that in Cosmos, it's also like tens of trillions of tokens on the visual tokens. So putting this together, the cost of like training these video models, it's actually comparable with LLMs. Not to mention the, the infra is slightly different from LLM, so it might be less efficient to train these models.

Faster Inference38:21

Swyx38:21

Do you get the benefits of traditional diffusion speed up? So-

Ethan He38:26

Mm-hmm

Swyx38:26

... for, you know, images, there's LCM, LoRAs for, um, you know, fine-tuning. There's, there's a lot of stuff that's been-

Ethan He38:32

Flow matching.

Swyx38:33

Yeah, there's flow matching. There's a lot of stuff that's been done. Uh, there's some overlap that applies to diffusion on the inference side and stuff or?

Ethan He38:40

Yeah. So, so the difference-- The inference side is a completely different story.

Swyx38:45

Yeah.

Ethan He38:45

I think for the training side, it might be a little bit hard to reduce that cost. And for, for the inference side, the biggest gain is from the distillation of these models. You, you can-- It's called s-step distillation, slightly different from knowledge distillation in LLMs. So you-- Typically, for flow matching models, you need like a hundred steps or something.

Like a diffusion model even need, need even more, like a thousand steps to, to generate a good image or video. A step distillation is try to learn to generate your step from the model itself. It's kind of like now we-- you use the full model to generate in a hundred steps, and then you, you take a model that only generate ten steps, and let that models learn from the perfect one.

Swyx39:41

Yeah.

Ethan He39:42

Uh, why this, this work-

Swyx39:44

Strong to weak seemingly.

Ethan He39:45

It's, it's kind of-

Swyx39:46

It's not distillation

Ethan He39:46

... kind of like strong to weak. I guess the-- from the modeling perspective, the strong model-- the teacher model is trying to model the image and videos of inter-internet, and that distribution is extremely complex. But the step distill model is just trying to learn from the teacher. The teacher is, is a model, and the size is fixed, as the distribution is much simpler than the whole internet.

That's the intuition I have why step distillation can work. So usually these models serve in productions, they only run in a few steps. In, in Cosmos, I believe we have, we have like four step and eight steps. If you do some simpler task, like, uh, image-to-image translation, it can even run in first step, like one step in, in Cosmos Transfer.

Swyx40:39

Yeah. Uh, I think this is the same intuition that guides a lot of the consistency model work. I, I sent you a, a link for, uh, SCM. I don't know if you-

Ethan He40:47

Yeah

Swyx40:47

... uh, covered that. To me, that was actually one of, like, the most impressive papers I've ever seen from OpenAI.

Ethan He40:51

Mm-hmm.

Swyx40:52

That like, uh, this is the unifying grand concept of consistency models. I don't know if you have any comments on this.

Ethan He40:58

So, so there are, there are a few different approaches, like, uh-

Swyx41:03

Oh, yeah, yeah. Here it is.

Ethan He41:03

Yeah.

Swyx41:04

Two steps versus twenty or a hundred steps, whatever. It's already done.

Ethan He41:09

So there are, there are a few different approaches, for example, consistency model, and there are also... Actually, we, we shouldn't forget GAN. So-

Swyx41:18

Yeah

Ethan He41:19

... GAN, actually that, that was, uh, that was the OG of-

Swyx41:22

Ethan He41:22

... the step distillation 'cause it, it trained just one step to begin with. So actually, a lot of, uh-- for example, there's a distribution matching distillation which use, which uses GAN, um, as, like, as one of the laws for distillation. It-- GAN just tells you, "Hey, like, generate a image," and then it has a discriminator to, to tell, is this image real or not?

So the model, the model just need to learn one of the distribution, not, not the full distribution. Because in training, the model is asked to reconstruct the ground truth image from the internet, which is extremely hard. And in-- when you're training GAN, it's a one-step process. It's just a, "Hey, you generate image. Does this image look, look as real as the, the image from the internet?"

Which is a much simpler task. And, yeah, combining a lot of these approaches together, people typically do that, like consistency model and distribution matching and, and GAN, and we can get these few step models.

Swyx42:38

Okay. Then there's one step I wanted to add, which is audio-

Ethan He42:41

Yes. Yeah

Audio-Video Generation42:42

Swyx42:42

... and video.

Ethan He42:43

Yeah, so, uh, Grok, Grok Imagine 0.9, I, I believe it's, uh, is a first, uh, first audio-video transmodel deployed at a large scale. So-

Swyx42:56

And that was your first model?

Ethan He42:57

Yes, that was, uh, Grok Imagine's first model. It's, it's audio-video, uh, joints generation. I think the, the hard part is, like, the, the modality alignment, 'cause before this transmodel, like we have, we have text to video alignment. We, we have this, uh, correspondence between text and video. Typically mo-most of the VLMs, they, they understand images and videos.

Uh, videos very well, and they, they don't understand audio mostly. And if you look at the audio generation on the LM side, you can talk to them perfectly fine, but if you ask them to sing a song or something, it typically is not very good. Also, they don't have, they don't have music either.

Swyx43:48

Mm.

Ethan He43:48

The hard part is that, uh, actually audio has two component. It has like a, a discrete component, a continuous component. The discrete component is like the, the language. So when we speak, uh, it's just, uh, some-

Swyx44:04

It's an ASR issue. Yeah.

Ethan He44:06

Yeah. It's, it's text token with some characteristics, I, I would say. But, uh, but music-

Swyx44:13

I think the speech guys would disagree with this, like disfluencies and then, you know.

Vibhu44:17

Yeah, tools should get angry.

Ethan He44:18

I say largely. But the, the mu- but the music is completely different. It's, it's very continuous, and you cannot model them like discrete tokens in language models. Uh, this, this is like the, the hard part for, for models is, uh, not to mention we have to align text, video, and audio together.

Swyx44:42

Yeah.

Ethan He44:43

So-

Swyx44:43

How?

Ethan He44:45

So significant-- Some significant challenges are like-- So, so first, like we, we talk about as the VLMs, they, they cannot understand, uh, most of them cannot understand audio.

Swyx44:56

Yeah.

Ethan He44:56

So you have to have some way to, to do the synthetic data generation for, for audio. You have to caption the model, and that involve, that involve society data and human data effort a lot. Uh, s- And not just surprisingly, most of the LLMs are very bad at recognizing, um, like the, the beat, tone, and the details of the-

Swyx45:23

Mm-hmm

Ethan He45:23

... of music. They, they can, they can give some general, uh, general prediction of which song is this, but it's very hard to describe the details of, of the music. Um, like we mentioned in image generation, like you have to describe image as detailed as possible so that someone blind can reconstruct that. So here is like someone-

Swyx45:49

Deaf

Ethan He45:49

... someone deaf can reconstruct how the music sounds like without, without actually listening to it. Maybe like, uh, you, you can think of it, it need to have, have the-- or, or they call the, the script.

Vibhu46:05

Subtitles, yeah.

Ethan He46:06

You gotta have, have all the details of, of the music, uh, and the dialogue.

Vibhu46:12

So i-is the challenge there typically stuff like music and audio, or is it just... Like is there a baseline? Okay, there's enough data where we can understand, you know, narration, conversation, but there's nuances in audio that that's where you hit all the data issues? Or is it just from stage zero, you just do it all right?

Ethan He46:32

So one important thing is like the alignment. So the model, the model has to know like the video and audio, the, uh, it has to have a time-based alignment, like at which time step the video and the audio token correspond to each other. We actually don't have this kind of alignment for, for, for most of the other modalities.

If you think about like text, text and image, text and video, they are loosely aligned. So you can, you can have a description of what's going on in the video, but you don't have to exactly, uh... You typically don't have exact description or at, at, uh, time step one second-

Vibhu47:18

Yeah

Ethan He47:18

... like what happened-

Vibhu47:19

It's very-

Ethan He47:20

... at time step two second.

Vibhu47:20

... coarse. Yeah.

Ethan He47:22

Yeah.

Swyx47:22

So what, what was the ideal time step that you have to update it, and then it's like four seconds or something?

Ethan He47:27

So that comes down to how you design the model-

Swyx47:30

Yeah

Ethan He47:30

... to-- for the model to, to be aware of as a time, as a time modality. So the model is like a time aware. And that's something pretty unique if you think about LLMs. So if you ask LLM to complete a task, say they, uh, you ask them, a-and they will say, "Oh, this tax- task will probably take twelve hours to complete."

And they, they, they come back in one hour, say, "I've already spent two days on this- ... and I've exhausted everything." Yeah, so, so the LLMs them-themselves, they, they don't have a, a sense of time there.

Vibhu48:10

I actually don't think that's just them not having a sense of time. I think it's somewhat based, right? Like-

Ethan He48:15

Mm-hmm

Vibhu48:15

... you tell someone, "Okay, go work on this feature. Go implement this," there's a general understanding you would have of how long that would take without LLMs working at LLM speed, right? So you think back like two years ago, if I tell you to like build me like a new front end for latent space, have a search bar, have all this, you'll estimate that it'll take a few days, right?

Ethan He48:36

Yeah.

Vibhu48:36

So you tell an LLM, "Go build this." It'll take me a few days. But you know, uh, I think it's somewhat grounded as opposed to them not. Having the best-- not saying that they have a great understanding, but I think that example is like you can see where it comes from, right? You're trained on all over the text.

Swyx48:52

They-they're, uh, trying to estimate what a human would say.

Vibhu48:54

Yes, because that's what the, that's what the data kind of represents. It's not them-

Ethan He48:59

It came from the cor-

Vibhu48:59

Yes

Ethan He48:59

... corpus on the internet. People have a estimate distribution of time.

Vibhu49:02

Yeah, and not, not even just in direct like training samples, right? Just your world understanding of tokens of how long stuff takes, right? Go read a book. It'll take you a while, right?

Swyx49:13

Yeah, yeah.

Vibhu49:13

Even if you do nothing but read a book, it takes a few days. So yeah-

Swyx49:16

Mm-hmm

Vibhu49:17

... I'll admit it took me a few hours.

Swyx49:18

Yeah.

Vibhu49:18

It'll take me a few hours to go through this research. But this is a tangent.

Swyx49:22

Somewhat, somewhat a c-- yeah.

Vibhu49:23

Yeah.

Swyx49:23

This is a train of thought I haven't really expressed until now, is which is basically like a full world model must also be recursive, meaning that the participant in the world model must also be aware that they have a world model. Uh , which is like this whole- ... recursive thing down the, down the line. Um, but yes, uh, and, and, and that the world model can be wrong, and that they need to update it and blah, blah, blah.

Yeah. We-we've, uh, argued this on the, uh, newsletter as well, that there needs to be sort of recursive or adversarial world models.

World Models49:51

Vibhu49:51

Okay.

Ethan He49:51

Yeah.

Vibhu49:51

I mean, just, you know, to ask, how do you define world model?

Swyx49:55

Oh, yeah. Let's go there .

Ethan He49:57

Yeah. So-

Vibhu49:57

So just for context, you know, we talked about, uh, video generation, and then there's a... If you say there's a distinction between world models, um, what's your, what's your definition? How do you see the two?

Ethan He50:10

Yeah. So, so disclaimer, I'm not going to debate like what is world model.

Vibhu50:15

Yeah.

Ethan He50:15

Like there, there are many definitions. So I'll just talk about my definition. Since I, I came from the multi-model, multi-model domain. So mainly talking from video. So world model is like real-time interactive long horizon videos. So there are, there are three parts. Like, so we-- let's talk about them one by one. So the, so interaction, so we just, we just look at Flipbook and n-neural computer.

So the interaction part of it, so you-- a world model can allow you to interact with them through keyboard or mouse and maybe also voice. So th-these all of-- all of the modality, you can, you can interact with, with the model, and the model should respond reasonably. Second part is real-time. So once you, once, say, you, you move your mouse, like if, say, the, the world model generate a game, like how, how fast can the game respond?

So i-if you're like professional CS: GO players- ...my, they say, "Oh, you have to respond-

Swyx51:24

He's a gamer

Ethan He51:24

...in sub, sub ten milliseconds or, or even less."

Vibhu51:28

Yeah.

Ethan He51:28

So that's not... I guess most of the-

Swyx51:31

No, sixty FPS. Let's go .

Ethan He51:33

Oh, three hundred FPS .

Vibhu51:35

Oh, five hundred FPS.

Swyx51:37

Wait. Uh, okay. Yeah, I didn't do the math, but yeah, okay. Um.

Ethan He51:40

Yeah, three hundred FPS, that's a three millisecond. So you have to respond-

Swyx51:44

Oh, shit. Okay. Yeah, yeah, yeah

Ethan He51:45

...within three millisecond. Most of the video models cannot do that.

Vibhu51:49

Yeah.

Ethan He51:50

And, uh, but if you, say, if you have a video model that is, say like a digital human, the, the respond time might be more generous. Maybe like, typically, like for real-time voice interaction, it's like two hundred millisecond. So that's, uh, uh, that's much more generous. But even two hundred millisecond is pretty, uh, it is pretty tricky, 'cause like remember we mentioned you have this, uh, temporal compression coming from the VAE.

So if you, if you don't compress the temporal dimension, your sequence length is going to explode. So if you want to have this real-time, real-timeness in your model, you have to deal with long context problem. And the third part is, is long horizon, 'cause we, we're not going to just play with, uh, video games just, uh, like a, a few seconds.

Most video models, only a few seconds. We're going to play with minutes, hours. The model have to be able to generate long-form content. So putting these three together, it's, uh, real-time, long horizon interactive videos. I think the, the final state will be, for example, like a, a video, a video version of Flipbook, where you can, you can interact with a, a neural computer.

You, you move your mouse, and you, you click on the generative interface, and it will reply to you through, through pixels-

Vibhu53:27

Mm-hmm

Ethan He53:27

...generated in real-time. But getting there, it's, it's a very long way to get there. So one of the first step, uh, at Grok Imagine, where I led a small world model team there, was, was to build video extension. So, uh, video extension-

Vibhu53:47

Ah, it's the first step of interactivity.

Ethan He53:49

Yeah. It's, it's a first step. Yeah.

Vibhu53:50

Ah.

Ethan He53:50

So it's the first step-

Vibhu53:51

You have it here, video editing. Yeah. Yeah.

Ethan He53:54

Yeah. So the, the first step is because, uh, this unlocks long horizon videos. Typically, for most of the video generation models, you, you give it a prompt or an image as an initial frame. You generate video. That's it. That's just, uh, one time. Done. And some, some creators would try to like use the last frame as a first frame for the second video.

It can-- sometimes it works, but if you do it a few times, it, it's, uh, the quality would decrease. And-

Vibhu54:26

It doesn't have that context-

Ethan He54:28

Yeah

Vibhu54:28

...over the full video, so the temporal-

Ethan He54:30

Yeah, exactly.

Swyx54:30

Yeah, 'cause you only gave it the last frame, of course, right? Yeah.

Ethan He54:32

Exactly. And-

Vibhu54:34

It's actually a pretty fun hack. Like if you've seen like-

Swyx54:36

Oh, I know. He's saying something better.

Vibhu54:37

Yeah, yeah, yeah.

Ethan He54:38

And for example, like a View, uh, I remember View three has like a, a one-second context of the, the last video. It is slightly better than using the last frame, but it has the same problem, similar problem that it Uh, the quality would degree, like if you extend a few times to like one minute, the, the video quality would look much worse than the first video.

Second, an-another problem is as a model doesn't have long range knowledge of like what's happening before. So if they generate some dialogue, uh, some, two people speaking, and their voice might change, uh, over, over some time, especially if the one second conditioning, it does not cover the previous context. So these, these are the core challenges. So the Grok Imagine video extension, it has hi-historical context of all of the previous generated videos.

Vibhu55:42

Mm.

Ethan He55:43

It can-- Uh, it has, it has the context of, uh, who is speaking and what, what objects have appeared and everything, having that to generate the next video. So if we naively do this, you, you can imagine, like just, uh, put all of the previous history video tokens into the context. The context lens will easily explode. Especially for, for video models, that, that can be like a few, a few million context, I would imagine, context lens.

Yes.

Vibhu56:15

What's wrong with that?

Ethan He56:16

Yeah. For example, like in Cosmos, I think just five seconds of video is like a fifty, fifty K or a sixty K number of tokens. So like if you do, if you do fifty second, that's a five, five hundred K tokens. If you, you do longer than that, easily explode. This long horizon, uh, problem was the first step we were trying to solve world model.

It turns out people, yeah, people love video extension. Like a lot, a lot of the creators love using video extension to create longer form videos. This is the part I, I like that you have a, you have an intermediate step toward the, the final goal-

Vibhu57:00

Mm-hmm

Ethan He57:01

... instead of just a straight, straight shot to the, the final version very much.

Vibhu57:05

Yeah. But I can see you have a strong vision of where we want to end up.

Ethan He57:08

Yeah.

Vibhu57:09

Does it seem like it's an efficiency issue? Like, okay, we're at a few million tokens context, you know. If you draw the parallel to language models, we had very short context, two thousand, eight thousand, then, you know, you scale it up one million, ten million. Uh, sure, there's effective context, um, you know, but at the end of the day, it's just what's it worth?

Like, sure, there's a whole training data side. In video, it might be slightly easier because we have a hundred million token video, right? Just take a movie with the full context there. Like is this efficiency from an inference standpoint that like it's expensive, but we know how to solve it? Or like, why is this not the approach?

So like my broader point was on your second point of world models, you say it needs to be interactive and live, right? You should be able to play a game and see the interaction live. So one thing I see with research is a lot of what you actually serve is different than what you build, right? So we talked about distillation.

You train big model, you distill it, you do quantization, speculative decoding. We do all this stuff to serve it efficiently. Should we not just have a solution, like a world model that can interact well, do inference optimization, serve it, distill it secondary, so make it real-time after you solve it? So like a-another parallel is say, continual learning, right?

What we need is someone to solve it and show it works inefficiently. Give it a few years, people will make it efficient. Same thing with regular attention, right? It worked over a few years. People have different forms of attention, and we've scaled it to be efficient at long context, you know. So kind of two things there, right?

One is like, it seems like it works. You've scaled it. Can we not just scale it a lot more efficiently over time? Do we need a separate approach if this works? And same thing with interaction, right? Um, if we can get it done, like if we can solve some way that it works, you know, we can solve making it more efficient from an inference standpoint later.

Ethan He59:10

Yeah, that's actually a very good point. So in videos, there's actually a lot of redundancies. So we, we solve a lot of the pixel redundancy from VAE, but there's more redundancy in long, long range and long horizon videos. Say, if, if a character appear in the first clip and then it disappeared, it only reappear like, uh, at the end of the video, you probably don't need the, the context, like i-in the middle of the generation.

So you, you only need that character, uh, where, where you need. So that's why, uh, I helped build another feature. It's a reference video.

Vibhu59:53

Is it here?

Swyx59:53

Um, is it the same, same model release or different one?

Ethan He59:57

It's a different one.

Swyx59:58

Okay.

Ethan He59:58

You probably need to search on-

Swyx1:00:00

Okay

Ethan He1:00:00

... reference to video.

Swyx1:00:02

Okay.

Ethan He1:00:03

So reference video allow you to like upload up to seven images as condition and generate the video, say if like I want-- It can, it can be characters or objects or even scenes. Say like I, I want, I want condition Sean's selfie and holding, holding a blade or whatever.

Vibhu1:00:24

Yeah, we have a dog. We put the dog in the thing.

Ethan He1:00:26

Yeah. Yeah, you can put them there and the video models will generate the, the video from and copies the context over. So that can solve a lot of the problems there, like the, the long context problem. It doesn't need to have a very long context, but it's-- I feel like it's an intermediate solution. The model-

Vibhu1:00:45

It's cheating. Yeah.

Ethan He1:00:47

Yes, the model should be able to like selectively know, like where, where should I draw-

Vibhu1:00:53

Yeah

Ethan He1:00:53

... the references. So say if I want to generate a movie, I generate it autoregressively, like a ten second at a time or something. And now this character appear, I can look back to where it first appear and, uh, bring that back. Yeah, this one I, I put the references Yeah, that's, uh, Optimus, Einstein myself- Danny.

Swyx1:01:19

Oddly enough, I, I used Grok search to find it, and it pulled your LinkedIn post. But you know-

Ethan He1:01:25

Swyx1:01:25

... we found it.

Ethan He1:01:25

Oh, interesting.

Swyx1:01:27

But-

Ethan He1:01:27

Th- th- okay, this is a problem. Uh, this is not your fault, but, like, xAI doesn't communicate all this work that you do very well because they just have the model release and then that's it. But, like, actually these details are very, very good. Thanks.

Swyx1:01:39

As far as I understand, everything you just described is state-of-the-art. Like, like no one else has done it.

Ethan He1:01:44

Thanks.

Swyx1:01:44

Yeah.

Uh, a lot of-- yeah, I, I have a lot more-

Ethan He1:01:49

And like, and then, and then you just put this blog post with the cookies. I'm like, this is not enough, you know?

Swyx1:01:54

Yeah.

Ethan He1:01:54

Uh, but I, I obviously this is like the high level numbers that people wanna know. But okay, so, so-

Swyx1:01:59

And, and I wonder, you know, like part of that is also like, uh, some, some labs don't share research, research into what happens and if you-

Ethan He1:02:07

No, but this is literally bragging about how good they are, right? Like, why would you not say that you are capable of extending with full context?

Swyx1:02:15

Mm-hmm.

Ethan He1:02:15

You know, this is not a secret sauce. This is like we did the work. Like, yeah, I don't know. Yeah. I guess, uh, different labs have slightly different communication styles.

Swyx1:02:24

Yeah. Anyway, if anyone from xAI is listening- ... we, we are always h-happy to help you tell your story. Yeah. Okay. So you, you did references and I think, I, I think kind of the, the point here you're making is like, it is sort of like a kludge, right? Like this is... You can do seven, but what about one hundred?

Ethan He1:02:40

Yeah.

Swyx1:02:40

Right? Then you need a completely different thing.

Ethan He1:02:43

So I think it's like this is like a mechanism to like select the context from the history, and you might not put the entire history into the context. Uh, for example, there's a paper called Frame Pack, which have a heuristic that the latest history, like the last one second, I put the, the entire history and the history before that, I would, uh, compress it and makes the video smaller.

So they follow this pattern, this buildable pattern that the maximum sequence length is fixed. So the further you are from the current frame, you have a smaller image. So this is just a heuristic. I think it can be more automatic. The model is aware-

Swyx1:03:30

Yeah

Ethan He1:03:30

... like which, which history part of it can be select. So this part of the research is actually being actively, uh, worked on by a lot of people. It's also quite interesting. I feel this is actually, this part of long context is a little bit ahead of the LLM part.

Swyx1:03:48

Mm.

Ethan He1:03:48

So for example, again, in LLMs, if you-- so contexts keep growing. Let's say if you, you call tool and the tool call history is extremely long, that's still in context and, and keep growing, keep growing. Even if you switch the topic to something else, the whole con-context was there. There, there are some agentic harnesses that help you to, say, prune the tool results and, uh, prune, like when you, when you query a file, only show like the top two hundred lines or something.

But tho-those are very heuristic driven.

Swyx1:04:25

For listeners, we did a write-up on the cloud code, uh, leak-

Ethan He1:04:29

Mm-hmm

Swyx1:04:29

... where there are eight different kinds of pruning, uh, including like you prune the tool results and all that. So you can, you can read up on that kind of thing.

Ethan He1:04:34

Yeah. I think, uh, one breakthrough in continual learning might be like a way to automatically, uh-

Swyx1:04:43

Yeah

Ethan He1:04:43

... manage its own context.

Swyx1:04:44

These are all heuristics, and they will be replaced by machine learning.

Ethan He1:04:48

Yes. Interestingly-

Swyx1:04:49

The-

Ethan He1:04:49

... the same thing is being researched in both LLMs and video models

Swyx1:04:53

... but the interesting thing is also like in the paper you showed, it's actually happening at the model level, right? Compared to like language models, sure, we have base attention, but you know, we'll do our own compression, we'll do our own pruning, which is separate from model error.

Ethan He1:05:06

Yeah.

Swyx1:05:07

Eventually it all just boils in, hopefully.

Ethan He1:05:09

Yeah.

Swyx1:05:09

Yeah. Yeah, I, I think this is a form of like attention, uh, but like also know sort of reasoning attention. I, I-

Ethan He1:05:17

Mm-hmm

Swyx1:05:17

... feel like that's different than normal attention.

Ethan He1:05:20

Mm.

Swyx1:05:20

Does that, does that make sense?

Ethan He1:05:21

Yeah, yeah. It's, it's different in the sense that attention, not to mention, uh, set sparse attention aside, like normal attention-

Swyx1:05:30

Like UKV, yeah

Ethan He1:05:31

... you, you have to attend to all of the tokens.

Swyx1:05:33

Yes.

Ethan He1:05:34

So you, you don't have a high-level mechanism to, to drop which tokens do-- you don't want to attend to. As humans, humans' attention span is surprisingly small.

Swyx1:05:45

Yes.

Ethan He1:05:46

You, you can only remember eleven digit of a phone number.

Swyx1:05:49

But I have feature detection, right? I can detect, oh, that's a sequence of one, two, three, four in a phone number that is eleven digit.

Ethan He1:05:56

Yeah.

Swyx1:05:56

Very good pattern matchers.

Ethan He1:05:58

Yeah. But humans' context can, like attention can work because we can dynamically pull in, uh, context from, from different places. The same mechanism, uh, I think is going to happen for LLMs and video models. I think we have-

Swyx1:06:14

Yeah, or LLMs is recent-- is on, it's on the recent work-

Ethan He1:06:16

Yeah

Swyx1:06:17

... is there, which is not that, uh, crazy, but it's just recursive.

Ethan He1:06:21

I think it's somewhat inherent in models too, right? Like you-

Swyx1:06:23

Here's a nice example

Ethan He1:06:24

... you pull up these, you can read it fine, but, uh, language models are also very good at slop parsing. Uh, you know, you have a-

Swyx1:06:32

Yeah

Ethan He1:06:32

... trans-

Swyx1:06:32

I throw my typos in there, it doesn't matter.

Ethan He1:06:34

Yeah, yeah. You have a, you have a transcript, you have whatever, just throw it in and it's very good at parsing through noise. Um, m- you know, that may be a brute force. It can look over a reason over it, but like, you know, there's, there's parallels to both.

Inside xAI1:06:48

Swyx1:06:48

I think it's just really fascinating how you relate the world models stuff to the video generation, which I don't think a lot of people hear directly, uh, from people like, like you. So I think that's really helpful. Any other work? Do we cover like video, audio, uh, world models, uh, any other stuff in that omni team, I guess?

Ethan He1:07:06

Or any other work at xAI you wanna talk about? Seems like everything we see publicly announced, oh, cool, cookies, and then there's so much more to it.

Swyx1:07:15

There's a lot of depth.

Ethan He1:07:16

Any underrated stuff, you know, just at the time there?

Swyx1:07:20

Yeah. I feel the

Ethan He1:07:21

Is a culture, it is quite interesting and a bit underrated. So, so the culture is, the culture is a s- it's three sentences: move fast, build... no goal is too ambitious, and the first principle. Like, y-early the, the goal set w-was very ambitious. It wasn't very... It wasn't, it wasn't possible to, to achieve when, when I, when I was thinking, first thinking about it.

Like, for example, I could build, build something in, in three months and- Was that like, "Okay, we're starting team, we want image, we want video, do it by this deadline?" Or, you know, how do you work back? Like, was it just, "Okay, we have a rough by, you know, this date we want something out," or is this like- Yeah.

That's a very good point. So it's from first principle thinking. Mm. If you think about, people might say that first principle thinking applied more to the physical world than the, the models. Uh, I would say, for example, like if you think about some limitation, for example, acquiring data, like how, how fast can we acquire the videos? And if, if you think about, like training the models, like, uh, what's the iteration speed for training a model end-to-end?

And how, how would adding more GPUs accelerate that timeline? And maybe if you need human data, like what was the turnaround time for, for human data to arrive? If you put all of those together, that is first principle thinking where, oh, you know, like what is the timeline? What's the minimum number of days that is possible to achieve something?

I think that's a-- this is a lot of Elon's type of thinking, right? Yeah. He's like... I, I think he's famous for saying that the only law you can't break is the laws of physics, something like that. Yeah. Yeah. Just broadly, you, you worked a lot with Elon. Yeah. I, I guess one benefit is like, uh, w-working at xAI, you, you got a chance to interact more with Elon.

So, so I, I was very f-fortunate to get a few retweets from him. And that, that was quite fun. And, uh, he, he also worked very closely, uh, with, with people. Uh, like, like people imagine online, like he, he's very hands-on. There are two things. Um, one... So I was actually looking up, uh, Elon retweeting you. I'll pull it up.

Uh, he talked about you, you tweeting that you have a really good voice mode. I don't- Oh, me? No, no, no. Him, him, him. Oh, I also did it. But anyway. I actually... So I would DM you feedback on voice mode because I was like- ..."Wow, really good." And then I'm like, "Oh, this sucks." But, um- ...

I don't know. Anything you want to talk about, about your voice mode building it? Was it a team you worked on as well? Oh, that's, that's actually not part of the, the team I worked on. Okay. Yeah. He has probably worked on more of the- Okay ... video. No. Uh, but Grok Voice actually- Grok Voice ... like very good.

I... This is one of those things where like, uh, first of all, you can speak at 2X, which is fun. Yes. Uh, which I listen to 2X, so I like to speak at 2X. But also I think like the interruption was better than Gemini. Uh, I don't know how it compares to ChatGPT real time now, but like, you know, as far as like driving was concerned, like having Grok in my Tesla and like driving, I think it was like this really good experience.

Yeah. Yeah. He likes voice mode. But also, um- ... just the crazy reads by Elon too. Fifty million views for just saying, "Yes, true." Yes, true. Um- Oh my God. But, uh, you know, it's, it's pretty cool how fast it came out. I guess the other thing is the safety aspect of video mode. Anything interesting to talk about there?

So when- Ah, spicy. Spicy question. A lot of the countries where they, they don't allow like a ge-generative data-- generative AI, uh, videos without watermarks. So in all of the-- those countries, uh, Grok Imagine had watermarks and a lot of the-- a lot of the takedowns of, of the videos were also happening extremely fast. I mean, it's, it's part of running a social platform- Yeah ...

but also it, it transfers nicely to the GenAI side. Do you have a perspective on SynthID versus other kinds of watermarking? Yeah. I guess it's going to be... Yeah, it's going to be harder and harder to, to detect, uh, the... Yeah, these things. So SynthID, one thing is, uh, previously it was only Google and now, now, like a lot of different labs- Yeah.

OpenAI updated, yeah ... are also adapting it. As a, a limitation is like the, the technology... The s- the paper was out there and people can reverse engineer like how to get rid of it. Yes. And it's... I, I think even a-as it advance, it's, it's still, still possible to reverse engineer it. Yeah. Uh, so if you are interested, you can go onto Reddit and people have taken out the exact like, uh, I don't know, what do you call it?

Mask or- Yeah ... pattern that Google applies, and then you can apply it onto any Google-generated photo, and you can reverse out the SynthID. Yeah. And it's, it's also harder and harder to just judge by eyes. I remember like a couple of years ago, there's are like a six fingers or, or something. Yeah, yeah, yeah. It's very obvious.

My, my current is actually the audio. I feel like the audio is really lacking. Uh, my way to tell if something is AI-generated, outside of like, "Okay, I think I've seen enough, I have a decent eye," the audio matchup, especially of Sora, is not great. It's all similar style. But there is- I see. You know, those, those are minor imperfections.

Yeah. But I think the, the, the point is that like... Actually, my closest reference to this is, uh, also Ian Goodfellow, because I think he did like the adversarial GAN thing where like- Mm-hmm ... it's like, "Okay, here's a picture of a zebra." Then you like change one pixel and it becomes a panda.

Swyx1:13:29

Right? This is, this is like a classic computer vision issue.

Ethan He1:13:33

Yeah, if you think about how, how these models were, were trained, like I, uh, like I mentioned before, like GAN was in the training process. The objective of GAN is you-- is the model generates an image, and the model, there's a judge to tell, like, if the image is real or not. The model is trained to make the image more real.

So a-as the model become more and more advanced, it's going to be harder and harder. For me personally, now I have to judge by i-if the-- these videos have logical sense.

Swyx1:14:10

Mm.

Ethan He1:14:10

If these, these video-

Swyx1:14:12

Have a world model.

Ethan He1:14:13

Yeah, yeah, yeah.

Swyx1:14:15

No, I, I also like it-- the, the, the audio is too nice, like too stu-too studio quality. The lighting is too good. The skin is too clear. You know, the-- basically, the lack of imperfections.

Ethan He1:14:26

Yeah.

Video Agents1:14:27

Vibhu1:14:27

Do we have a good way to do reasoning in diffusion? Like, is that what separates video generators from world models? Or in-- you know, we really know how to apply it to-

Ethan He1:14:38

Mm-hmm

Vibhu1:14:38

... auto-regressive language models. Is there a parallel for diffusion VideoGen world models, like-

Swyx1:14:46

Yeah

Vibhu1:14:46

... on that point, right? Is-

Swyx1:14:47

He, he has a thing on video agents.

Ethan He1:14:48

Yeah, that's a good question. Yeah, actually, I have a, I have a pretty big claim. The, uh, the intelli-- the visual intelligence are actually mostly coming from language. Like, these video models, especially from now, since the diffusion model technology is more mature, the-- like, every time you see there, there are some improvement on these models, I would say mostly, the, the-- again, comes from language model, not, not coming from the, the vid-the video model itself, like the, the video distribution models themselves.

In Cosmos, like could be typically the, these models, they, they have two parts. Like there's a, there's a prompt rewriter or the prompt up sampler part. I think in, uh, in Cosmos, we use Llama or we use Mix, Mixtral, and the Cosmos video model itself is only 7B, and the model, the language model is a prompt rewriter.

It's, it's bigger than that. So the prompt rewriter's task is to take, take user instruction and convert it to extremely detailed description of the video. So because the video, the visual, the video distribution models, I would describe, they're, they're kinda dumb because they, they take the input instruction literally. Because in the training process, remember that we have to describe the video as, as detailed as possible when we are creating the, the synthetic, uh, text pair.

So this model, they take those kind of instruction to generate the videos. So in-- when you're taking the user instructions, the user instruction usually are simple, just say a cat or something. If you put a cat in, in the video model, they would take that instruction literally. They, they would literally show a cat, a cat in maybe a white background because you didn't describe the background.

The cat is not moving because you didn't describe it. It, it takes the instruction quite literally. It's kinda, it's kinda dumb. The, the prompt rewriter is actually a much bigger model. It's a language model that takes, uh, the user instruction and expand it. So the thinking process you mentioned, uh, is from there. So i-if you, if you look at like GPT image, like you generate a image in certain minutes, certain minute, it's, it's not all like a pixel generation.

A lot of time is spending-

Swyx1:17:36

Prompt writing

Ethan He1:17:36

... in thinking.

Swyx1:17:37

Mm.

Ethan He1:17:37

So, so prompt re-rewriting now have evolved to, like, not only just as thinking, it, it can, it can also be a, a agent, a agentic model. For example, say you want, you wanted to generate the image of today's news. So the-- so it's likely they'll go to fetch today's news online and then, like, process and digest them, then organize the layout and generate it.

Another thing quite interesting is, um-

Vibhu1:18:10

If I'm not mistaken, these are-- it's no longer a diffusion model though, right? It's auto-regressively... Or is there still?

Ethan He1:18:19

There are different approaches. For example, like, uh, Gemini Omni. Since they said-

Vibhu1:18:24

Yeah

Ethan He1:18:24

... it's Omni, I believe it's a, it's a single model. Maybe it's something like, uh, it's a language model with a diffusion head or something. Like the language model do the thinking, do the agentic tool calling, and then it would, uh, use the diffusion head to generate the image in the end. There, there were also approaches like Cosmos, where you, you have a separate language model and separate diffusion models.

And there, there were also like a purely language model, like you, you discretize the images, and then you generate the image as discrete tokens. So there are different approaches. I would say, like-

Vibhu1:19:01

One of, one of the claims I've seen for why these approaches struggle is because a lot of the benefits for how we currently learn reasoning with language models is you basically iteratively generate reason, you have your thought, and then you work on that answer, right? So if you have like Omni model and then diffusion head, you can't feed that back in to continue reasoning, right?

So you can't go like text image, text image. You can't reason on the output-

Ethan He1:19:27

Vibhu1:19:27

... and then go back to diffusion. But I guess in the new Gemini Omni, you would be able to, as long as you have

Ethan He1:19:32

Yeah, I'm not sure if they have that process. I guess it's definitely possible in the omni paradigm.

Swyx1:19:38

Yeah.

Ethan He1:19:39

So if you think about like traditional multi-modal language model, they would have a VIT encoder that can encode the image. So if they have a diffusion has, it can generate the image and then put that back into the VIT encoder, encode that, and then do, do the iterative refinement if the result... Yeah.

Swyx1:20:01

I think you have to jointly train the VIT and the diffusion to make that somewhat reasonable, because otherwise you're kind of like, uh, mismatching or feeding in slop .

Ethan He1:20:11

Yeah.

Vibhu1:20:12

I think it depends on the stage of training. You might be able to freeze it. But anyway, also just on your earlier-

Swyx1:20:18

Uh, I, I wanted to also make explicit, we do know that NanoBanana and GPT Image are autoregressive, uh, language model with diffusion head.

Ethan He1:20:25

Mm-hmm.

Swyx1:20:26

Uh, as far as I can tell from your description of Groq Image, it is not. It is, it is end-to-end.

Ethan He1:20:31

I, I cannot comment on that.

Swyx1:20:32

You cannot comment on that. Yeah. Well, the way that you described it. Uh, but yeah, I, I think there's, there's different approaches, right? Like-

Vibhu1:20:38

Yeah

Swyx1:20:38

... you started off saying prompt rewriter is like the, a big part of the intelligence.

Vibhu1:20:41

A-and even on that, I think everyone should try using an early diffusion model. If you've used Stable Diffusion 1 or whatever, if you've seen the prompts like, uh, you know, ultra-high res, 4K- ... this style, like, oh my God, the first time I tried one, you don't talk to them like language models, right? Your prompting is very, uh, you know, comma separated-

Swyx1:21:00

You're literally talking in the labels that were-

Vibhu1:21:02

Yeah. Yeah. Yeah

Swyx1:21:02

... in the data set, right?

Vibhu1:21:03

Yeah.

Swyx1:21:03

But, but basically, like I'm just trying to make the point that prompt re-writer and then image is different from autoregressive language model-

Vibhu1:21:10

Yeah

Swyx1:21:10

... with diffusion head. Right? They're different things.

Ethan He1:21:13

Yes, they're different. Yeah.

Swyx1:21:14

Yeah. Just wanted to establish.

Ethan He1:21:16

I say like the, the common part is like the, the image part. So, so it's, it's quite surprising that like a lot of the improvement came from the-

Swyx1:21:29

Language side

Ethan He1:21:29

... the thinking, the, the tool calling. So, so I still remember like in Cosmos, I, I generate a happy sheep and can... If, if without any rewriting, it's-- it looks so, uh, CGI, and after rewriting, it looks, it, it looks so beautiful. I, I think-

Swyx1:21:49

Without any joint training?

Ethan He1:21:51

Yeah, actually, without any joint training. Uh, it's-- with rewriting, it's already much better. See, a very interesting thing, I guess what happened is the video agents, mostly language models, will call these, uh, generative model, either it's a separate model or a diffusion head or whatever, as tool. So this model can iteratively refine the results or, or even like generate longer content through a very long train of thought.

It's actually very similar to how human create art. So, so we don't, we don't generate the pixels directly. We, we literally draw something on... And s- I think through this process, like u- these, these models not only use diffusion, uh, diffusion as one of the tool, it can also use traditional tool. It can also use, uh, image editing tools from Photoshop.

It can use, uh, video, video editor, FFMpeg, whatever, to c- take combination of these and the generative AI technology as a, as a set of tool, and they can, they can iteratively create a, a better, much better, uh, video for, for production grade quality. If you look at existing, uh, professional creators, they don't, they don't end at, uh, generating a video from, from, from these models.

They would take this video to, to their editor and edit here and there.

Swyx1:23:28

So much post-production-

Ethan He1:23:30

Yeah

Swyx1:23:31

... uh, in-- And sometimes actually, like the, the reason the video is good is not really the video model, it's actually the editing .

Ethan He1:23:38

Yeah.

Swyx1:23:38

And yes, we also are engaged in the same process as well. We'd love to use a video editing model.

Ethan He1:23:44

Yeah. Actually, there's, uh, Groq, Groq Imagine agent beta. That, that was the, that was the first attempt i-in that direction.

Swyx1:23:54

Yeah.

Ethan He1:23:55

So I, I think the, like the process would be similar to like-

Swyx1:24:01

It's just agent mode. Yeah.

Ethan He1:24:03

Yeah, you can, you can ask it to-

Swyx1:24:05

There's no blog post for it

Ethan He1:24:06

... maybe generate a, a one-minute, uh, video, like which, which is not possible i-if you ask the same prompt to video models. But this model will ca- literally call different tools to do that. So yeah, this is actually an interesting thing. So when, when they first released the, a video editing model, like, uh, I see on X some, some people try the video editing feature with like, "Edit this video to be one minute."

Uh, 'cause they didn't understand how video editing work. Video editing typically-

Swyx1:24:42

Yes

Ethan He1:24:42

... is just a removal, add, replace, style transfer, this kind of thing. But that's actually a valid request under the assumption of video agents. So these agents should be able to understand these kind of, uh, long horizon tasks to be able to easily, uh, create a long-form video. I think this is, uh, this is really fascinating 'cause it is kind of take-- it's taking the same, uh, same direction as first you have these, uh, uh, assisted-- uh, AI-assisted coding, kind of like tab completion, uh, GitHub Copilot.

And from there, you gradually evolve to Codex and Cloud Code, where you do things fully automated. So in, in agent, in Groq Imagine agent mode, you can, you can still go in there and do, do stuff by yourself. But gradually, as, as the model capability increase, it will be able to do everything fully automated.

Swyx1:25:47

Yeah. Um, I, I like that. Uh, okay-

Ethan He1:25:49

It's good

Swyx1:25:49

... so it looks like it's still generating.

Vibhu1:25:51

Also, I, I did notice the Groq ImageGen was always very, very fast. I don't know if this is something you guys benchmarked, but, like, this is just a tangent. Compared to-

Ethan He1:26:01

Mm-hmm

Vibhu1:26:01

... uh, when I used to use before the latest OpenAI's ImageGen, and same with Gemini NanoBanana, I would oftentimes use Groq-

Swyx1:26:11

Yeah

Vibhu1:26:11

... just for the speed.

Swyx1:26:11

It's, it's in the benchmark somewhere. There's, uh, in the Imagine API blog post that they have all the, the speed things.

Vibhu1:26:16

Yeah.

Swyx1:26:17

Uh, it mostly combination of distillation plus inference.

Ethan He1:26:21

Yeah. There, there are a bunch of things. Like, we, we talk about distillation, and if you talk about thinking, if you don't have any thinking budget, the model can just think three minutes and then come back to you. And also like inference, the inference infra team w- was very talented, and they were, they were able to accelerate a hell lot of these models.

Swyx1:26:44

Yeah. Yeah. I mean, you know, my comment on the, on the video agents things, like, I'm trying to figure out, like, when people say video agents, when you initially told me about v- your bet on video agents or your, your vision for video agents, I was a little bit disappointed. I was like, "Ah, you mean like models are tapped out, now we have to do agents?"

But, like, I think you have to, right? The question now is, how much model training is, is it really going to m- make a difference versus just building a better harness? Like you said-

Vibhu1:27:12

Swyx1:27:13

... uh, the models don't have to be jointly trained. Uh, you can just take an off-the-shelf frontier reasoning model, slap it on a harness, give it Groq as a tool. That's it. That's your video agent. Doesn't seem super satisfying. Obviously, you can co-train and, and get some more percentage points of per-performance. But, like, if your central claim that the majority of video or generative media, uh, alpha or whatever, is actually coming from language intelligence and not, um, image diffusion or video diffusion, then that is the future.

Vibhu1:27:47

I mean, it's pretty cool-

Swyx1:27:48

Primarily just weight

Vibhu1:27:50

... if you, if you-

Swyx1:27:50

Yeah

Vibhu1:27:50

... if you pop back at the example, you know, it, it generated frames. Sorry to interrupt, you know-

Swyx1:27:54

Yeah, yeah

Vibhu1:27:54

... it's been saying like, "Okay, I'm going to start stitching these frames together."

Swyx1:27:59

So-

Vibhu1:27:59

It's using FFmpeg-

Swyx1:27:59

Yeah

Vibhu1:28:00

... using code.

Swyx1:28:00

This is what GPT Image Pro as well is doing, right?

Vibhu1:28:03

Mm-hmm.

Swyx1:28:03

Like, this is also just writing code in the background and then just-

Vibhu1:28:05

Stitching

Swyx1:28:06

... doing an image pass on the final output. It feels dissatisfying for the people who want to just train models.

Vibhu1:28:11

It, it's interesting, right? Like- ... it's, it's also somewhat exciting. Like you brought up earlier, a lot of the gains don't come as much from the video. Like, I think you can see that in the language model space too, right? Anthropic, very, very good at coding. They're multimodal, not the best, right? They have basic input PDF, but like, you know, there's clearly a disconnect in the quality of their image video processing, audio processing, yet intelligence very top tier.

Other labs, Gemini, OpenAI, xAI, you can add modalities, but it's not like they're unlocking crazy capabilities, right? So it's interesting.

Ethan He1:28:49

Yeah. It's interesting to see that, like the, the video models' capability increase actually come from language model being more intelligent. I, I think video agent, like it, it can unlock more stuff than might... you might imagine. So there, there's a few things. So one thing is when we are prompting these models, so most of the people were actually not very good at prompting.

Swyx1:29:16

Mm.

Ethan He1:29:16

Actually, language models have a better sense of how to prompt-

Swyx1:29:20

Yeah

Ethan He1:29:20

... AI models. AI models know AI models better. So if you jointly train these models, maybe the, the model have a better sense of, uh, how to prompt each model. Like a different, different model-

Swyx1:29:32

Of course

Ethan He1:29:32

... might be different. Another thing is it might not as simple as just, uh, like generate a few clips and slap them together using FFmpeg. Like you might-- there might be more like image and video editing tool appear in this process. Say, if you want to exactly add, uh, add a blob of text at this timestamp, the videos model-- video models might not get that intention very precisely.

Swyx1:30:05

Mm.

Ethan He1:30:05

But these, these are possible using these deterministic tools. The long-- The, the video agents can use all, all sorts of tools, so you don't have to put all of the capabilities into the generation model itself.

Swyx1:30:21

Yeah. I, I think that's very true. Um, no, so, so for what it's worth, I think you're right. I think that this will be a big category. I think probably you are predicting like the next one year in video is going to be all this.

Vibhu1:30:35

Do you have a time, time prediction for how when this stuff ramps up? Like-

Swyx1:30:39

I mean, they already started.

Vibhu1:30:41

Is, you know-

Swyx1:30:42

It's not very good yet.

Vibhu1:30:42

Are we so... No, it's so, it's so good. I think the last one's just longer.

Ethan He1:30:46

Yeah.

Vibhu1:30:46

Uh, it didn't give me a minute.

Ethan He1:30:47

That's thirty-six.

Vibhu1:30:47

It gave me thirty-six seconds. But, you know, are we feeling it now? Is there going to be inflection? Is there any timeline predictions you want to make?

Ethan He1:30:54

I guess by the end of this year is, this is going to be a big hit. So the, the inflection point will be there and the, the videos generated by video agents can get to like production grade quality. Say it, it can be presented and it can be, uh, it can be distributed in, in ads. And one- once that happen, I think the enterprise will have much more budget for video models because the agents are i-inherently more expensive than, than the, than the video models themselves because they do this iterative process.

They, they generate many, many variations. Yeah, but once, once these models have this- Past this usability threshold, I think it's, it's going to be a exponential growth beyond that.

Swyx1:31:52

Yeah, I would, uh, fund a company right now based on this, uh, this thing. Um, so I think you're right. One thing I'm, I'm surprising, I'm reflecting on the whole, like, past, uh, hour or so of conversation. You-- I think you're into world models and video generation for video generation's sake. I think that a lot of other world models people, we've interviewed a lot of them, uh, general intuition and Fei-Fei Li and all, all those guys and, and Moondream, which I, I, I think I told you about.

Robotics1:31:52

Swyx1:32:17

Uh, Moonlake.

Ethan He1:32:18

Lake.

Swyx1:32:18

I keep saying Moondream. Goddammit. Moonlake. A lot of them actually say, like, robotics is the end game. Like, embodied robotics. Like, you want real-time, you want interactive, it is to interact with the physical world. You're not that concerned about it.

Ethan He1:32:32

I think robotics will be a, uh, will be a big part of it for sure. I guess the, the process may happen naturally. So, so my prediction on robotics is that the problem of physical AI might be solved, like without actually need to-

Swyx1:32:53

Be in the real world

Ethan He1:32:54

... need to be in the real world. So it might, it might get solved by a video-- A LLM is very strong video capability. So, so remember we talk about the, the real-time interactive long horizon video. Once these models-- So, so now these models are just training, training on, like, screen recordings and computer screens. Once these models can use computers and understand the future state of computer extremely well, the robots might be, might be one of the, one of the tools, like, a, a very powerful AI can, can use.

So the powerful AI might just, uh, be able to control the physical embodiment naturally.

Swyx1:33:45

I see that for sure. Cool. I, I know, I know we are coming up on time. Uh, you had-- You left one more spicy topic, which is why you left xAI.

Ethan He1:33:55

For me, um, there's, there's a lot of, a lot of research you want to do that you cannot do at, as a company, and also, like, the, the priorities and, and objective the-- of a company typically can, can change very fast. It is-- It's also the same for xAI. So, so now it's kind of like the, the time so there is some research I, I want to do, especially more on language model side-

Leaving xAI & Future1:33:55

Swyx1:34:25

Yeah

Ethan He1:34:25

... that I, I cannot do at xAI.

Swyx1:34:28

Oh, uh, okay, yeah. Just you're, you're basically leaving... You're-- You, you had this whole transition from computer vision to world models, video generation, uh, to now, now you're, like, focusing on LLMs.

Vibhu1:34:39

But it seems like, you know, a lot of you saying focusing on LLMs, you really, in the past hour, described how it all ties together, right? Like-

Ethan He1:34:46

Yeah.

Vibhu1:34:46

But I don't know. What, what do you mean by focusing on LLMs? Is there-

Ethan He1:34:50

I realized the fact that the, the video models, even, like, in the beginning, the game might come from improvement on diffusion technology, but this is a point where actually most of the game, uh, come from the language models themselves.

Swyx1:35:07

It's a huge black pill for anyone who has, like, spent their career in, like, generative, uh, media.

Vibhu1:35:13

I mean, uh, that's an extreme view, right? The-- You still definitely need a bit of both, right?

Ethan He1:35:18

Yeah.

Vibhu1:35:18

There's just, uh, it seems like more pressing, impactful work to do now on language model side.

Swyx1:35:24

Do you have any similar predictions? You-- So you predict the video agents, and I think you'll be right. Uh, on the language side, what are you looking for in the next one year?

Ethan He1:35:33

I think one thing pretty, pretty interesting I, I think might be happening soon is the, the language models will be, like, context-aware and manage its own context.

Swyx1:35:46

Yeah.

Ethan He1:35:46

So some-- Like, from, from the video model side, we, we've been suffering from the long horizon issue, like we want to generate video longer and longer, and we've been trying to solve the context length issues through various ways. One, one thing is just brute-forcing train longer context lengths. Another is to manage the context better. I, I think the same thing in language model is also going to be happening soon.

So for example, like the language models, they, they're not aware of how long their own context length is. Once they hit like eighty percent or something, the automatic context compression is getting triggered and, and the model, uh, is not aware of that when it's working. And some-- Maybe it's good for the models to, to know, "Oh, I'm, I'm approaching like eighty percent or something."

And so-something also pretty interesting, like for example, in OpenClau, like you-- every time you type in something, like, uh, a times-- uh, the current local time is automatically attached to your message, so the model actually know what time is it. So this is making the, the model time-aware. And also like in, in tool calling the, a lot of the intermediate tool call results automatically prune.

So there's like context removal, context addition, and, uh, context compaction. So all of these are from the harnesses themselves. But from our experience, the heuristic engineering all types of models get this absorbed into the models themselves. Uh, I guess that's something very interesting to explore.

Vibhu1:37:29

So infinite context?

Ethan He1:37:31

Maybe.

Vibhu1:37:33

No, but it's, it's interesting, right? Um, you-

Swyx1:37:34

It is in the space of memory and continual learning and-

Vibhu1:37:38

I don't know. It's also like in the space of agent harness use, right? You're saying-

Swyx1:37:42

No, he's, he's saying he doesn't want to do it in a harness, right?

Vibhu1:37:44

No, no, but models are also being trained on Unit-- using harnesses, right?

Ethan He1:37:49

Yeah.

Vibhu1:37:50

So some of it is, you could say, implicitly leaking in, right? Um, you know, part of that post-training of language models is okay, using it in coding harnesses, in which case, you know, when are sub-agents spawned? When is convection gonna happen? Uh, it-it's not explicit like, you know, you have this much token window, which I don't know if you want it to be, as that'll change, but it's, it's somewhat leaking in there.

Ethan He1:38:15

I mean, imagining what if the model have access to the whole-- the, the code of the agent harness itself and be able to modify it whatever it want. Say, if the agent harness is short enough, you can just put in the context lens in, in the system prompt, and then the model will say, "When, when I want to spawn a future version of myself, I can modify the agent harness."

For example, if I-- the agent harness can be, oh, when I'm reading a long document, I can choose to read the whole thing in chunks and, uh, come back, uh, smash the summary together. Or I, I can just read the first, uh, first two hundred lines and, uh, discard the rest and all, all kind of choices if they can be made by the models themselves.

It might be very interesting to, to see that the model can, like, uh, program-- the model can program itself online in test time.

Swyx1:39:19

Yeah. Uh, so the, the self-modifying harness is also part of, uh, OpenClaw and Pi, but I, I think there's a lot more work to do there. Very cool. I think part of me is kind of curious, you know, like I, I think you're a part of Big Lab, right? And there's this career path of a researcher at a Big Lab, which is you are-- you train models, you get more compute, you train better models, and you keep going.

And somewhat, I feel like you're opting out of that. And if I were you, I would be like, "Oh, I think this is, like, a bit of a career risk." You know what I mean?

Ethan He1:39:52

Mm-hmm. Mm-hmm.

Swyx1:39:53

I, I don't have any comment a-apart from, like, you're very strongly convicted. I think that a lot of people in your shoes would not be doing what you did.

Ethan He1:40:01

Yeah. Speaking of my career, if I, I look back, actually, there, there were, there were a lot of huge transitions. So, so ten years ago, I was, I was doing research with a ResNet authors, uh, Shang-Yu John and Chen Sun. Yeah, at that time, the, the research were completely different. It was like, uh, m-mostly confirmation, like image recognition, object detection, object tracking.

Uh, I w- I was also doing neural net compression at that time. It was quite different from knowledge dissolutions these days. And at that time, I, I was-- uh, I wanted to be a professor, and I, I applied. When I applied for a PhD, I already had a few first author papers at top conferences, so I, I confidently applied at the top schools.

It turns out I, I got rejected by all of the top PhD programs. So I had to, uh, I had to go to the industry. At that time, I was at Facebook AI Research fair, led by Yann LeCun.

Swyx1:41:08

I, I wanted to talk about VJPA, but it's different.

Ethan He1:41:10

Yeah. Anyways, yeah, we can leave it for another time. Uh, yeah, I switched to-- At that time, I switched to self-surprised learning. It was, it was quite different from what I was doing in contribution.

Swyx1:41:24

Yeah.

Ethan He1:41:24

And, and after that is NVIDIA Cosmos. So I realized scaling up was extremely important. So at NVIDIA, I, I was mainly focusing on scaling. So one thing is Cosmos scaling the video distribution models to, to a few billion parameters. And another thing is, uh, I was working on MoEs. The, the Megatron MoEs was the first, uh, was, was the first framework open source to be able to train these MoEs at very large scales, like, uh, hundred billions parameters to even trillions parameters efficiently at, like, forty percent MFU.

And going to-- switching to xAI was trying to work on even larger compute scale even further. And yeah, looking, looking at this trajectory, I actually worked, worked on a lot of different things. So I feel actually within, within ML, it's actually easier to switch than, than you think. Like, a, a lot of people might have mindset that, "Oh, I work on, I work on computer vision.

I always have to work on computer vision, and I cannot switch to language." And... But, but from my experience, at least at, at NVIDIA, I worked on both language model MoEs and also video models. It's, it's actually not the case. A lot of, a lot of the core principles how to train large models are, are largely the same.

And yeah, for, for me, I feel right now the, the bottleneck, uh, for, for video models is actually the language part, the, the agent, which w- which is why I want to go to work more on LLMs. One thing is it's, it's a bit of a challenge. I don't think it's a huge, uh, jump, so.

Swyx1:43:35

Yeah, I mean, kudos to you. I think you have a lot of, uh, strong vision there. Yeah, I think that was mostly everything that we wanted to cover. You've been very generous with your time, and I, I, you know, it's really nice that you are able to share all these things now. We don't have to go through xAI to, to clear everything.

Uh, but also we-

Vibhu1:43:52

Oh, you know-

Swyx1:43:52

I think we, we, we didn't get you in trouble.

Vibhu1:43:54

It's a lot of good stuff about xAI compared to what you just see in the releases, right? You don't realize how many more levels there are to it.

Swyx1:44:01

xAI, please do more podcasts. Uh, anyway.

Ethan He1:44:05

Thank you.

Swyx1:44:05

Yeah. But thank, thank you for, uh, sharing. It's been very kind. And, and also, like, I wanna hear more from you. I think you are going to embark on your, your next phase. You haven't announced what you're doing next, but clearly you have, you know, uh, more vision and more ambition on, on this path, and I think you're, you're basically kind of gradient descending to, like, whatever your final form is.

Ethan He1:44:25

Thank you. Yeah. Yeah, I'll, I'll share more a-about my next chapter soon.

Swyx1:44:30

Okay.

Ethan He1:44:31

Thank you for having me.

Swyx1:44:33

Thanks for coming.

Inside xAI: Building Grok Imagine in 3 Months, Videogen vs World Models, and Video Agents— Ethan He

Topics

Mentioned

Transcript

Intro0:00

From Cosmos to Grok1:22

Building Video Models11:41

The Cost of Training33:47

Faster Inference38:21

Audio-Video Generation42:42

World Models49:51

Inside xAI1:06:48

Video Agents1:14:27

Robotics1:31:52

Leaving xAI & Future1:33:55