LALatent SpaceMay 25, 2026· 29:59

⚡️ Google's Open AI Strategy — Omar Sanseviero, Google DeepMind

Omar Sanseviero, head of Developer Experience at Google DeepMind, breaks down Gemma 4's novel architecture with per-layer embeddings that enable parameter offloading, allowing a 2B active parameter model to run fast on devices. He explains trade-offs between dense and MoE models, notes fine-tuning is declining as base models improve, and highlights Gemma 4's native multimodal support for audio, images, and short video. The team is growing in Singapore and India, and Kaggle's recent integration will help benchmark agent capabilities.

  1. 0:00Gemma 4
  2. 3:14Launch
  3. 4:29Offline vs API
  4. 6:26Multimodal
  5. 8:08Multilingual
  6. 9:30DevRel at AIE
  7. 10:42Text Diffusion
  8. 13:37Fine-Tuning
  9. 16:29Sparse vs Dense
  10. 20:09Gemma Scope
  11. 23:59Auto-Research
  12. 26:06Team Expansion

Transcript

Gemma 40:00

Host0:04

We got so much. Gemma 4, Gemma 3 1, Gemma Scope, Met Gemma.

Omar Sanseviero0:08

Mm-hmm.

Host0:09

Give us the TLDR.

Omar Sanseviero0:10

Yeah, so yeah, Gemma 4 is just out. It's the, uh, most, uh, capable open model we've released so far, where we try to compact as much intelligence per parameter as we could, bring all of these multi-model capabilities. So yeah, uh, that's Gemma 4.

Host0:23

So one interesting thing, you have this thing with effective parameters-

Omar Sanseviero0:27

Yeah

Host0:27

... not active parameters. Uh, can you explain what it is?

Omar Sanseviero0:31

Yeah. So pretty much in the traditional transformer architecture, you have, like, this big embedding layer, right? Uh, and this new architecture is, uh, is more of a small change in the transformer architecture in the transformer block. Pretty much we add a per-layer embedding, so at every layer we add an embedding table. What is exciting is that you don't need to do, like, the full matrix multiplication.

This is pretty much a lookup table. So the Gemma 4 model is, uh, E to B. That means that it effectively has two billion parameters loaded into the GPU. Uh, it actually has almost five billion parameters, but those three billion parameters can be in the CPU, they can be in the disk, which means that you can do inference extremely quickly.

This is just a lookup table.

Host1:09

And what's the con? Why don't we al- why don't we always do this? Can it scale? Is it open research? Like, you know, it seems very, "Okay, if I can just offload half the parameters to save yours."

Omar Sanseviero1:19

Yeah. Yeah, so pretty much, uh, here we did lots of quality experimentation, and this is really optimized and designed for, like, on-device... Uh, and when I say on-device, I mean like running in a phone, Android, a Raspberry Pi, and so on, right? Uh, when you go larger, you usually want to compact more, uh... You want to have more, like, dense architectures or MoEs.

Uh, so this, this research, these research decisions were very helpful for these small, uh, small use cases.

Host 21:44

Yeah, something I learned from the run that you organized this morning-

Omar Sanseviero1:47

Yeah.

Host 21:47

Uh, for, for listeners, um, I think this is the first ever, like, official run club-

Omar Sanseviero1:52

Yeah, yeah

Host 21:53

... at AIE. Uh, 6:30 AM. Rough. Very rough, but, uh, at least I woke up for it. Uh, I met Cormac.

Omar Sanseviero1:59

Yeah.

Host 21:59

And he was telling me that, uh, apparently in China, the super apps are shipping models in the app bundle for inference and just, like, use among all their super app constituents.

Omar Sanseviero2:10

Yeah.

Host 22:10

And I don't know. Is, is, is that, like, a target use case for you guys?

Omar Sanseviero2:13

Yeah. So actually, if you install... Like, like if you buy a Pixel phone or a high-end Samsung, they come fr- with a Gemini Nano, and Gemini Nano is baked into the operating system. And Gemini Nano is really built on top of Gemma. So last year we released Gemma 3N, which was this architecture really designed for phone use cases.

And they use a Gemma 3N with some additional training, some additional adaptations, to make the model good for, like, traditional, uh, on-device use cases, right? Uh, so pretty much when you buy, like, these high-end phones, you can already use a Gemini, uh, out of the box.

Host 22:44

Yeah, we actually covered the 3N paper in our paper club, and this, like, idea of, like, sort of parameter offloading-

Omar Sanseviero2:50

Yeah

Host 22:50

... or, like, download on demand is, like, very cool. Is it exactly the same in the Gemma 4 stuff?

Omar Sanseviero2:56

Yeah.

Host 22:56

Okay.

Omar Sanseviero2:57

For the smaller models.

Host 22:57

Yeah.

Omar Sanseviero2:58

Yeah.

Host 22:58

Yeah.

Host2:58

And does it, does it scale? Is there potential G- So for reference, Gemma 4 is a 29B and a 31B, one's MoE, one's dense. But-

Omar Sanseviero3:08

Yeah

Host3:08

... have you scaled it? Have you pushed it up? Is it...

Omar Sanseviero3:11

We are doing lots of experiments.

Host3:12

Experiments? Okay.

Omar Sanseviero3:12

Yeah, yeah. Stay tuned. Yeah.

Launch3:14

Host 23:15

What goes into shipping a, a mainline model like this? Like-

Omar Sanseviero3:19

Yeah

Host 23:19

... what, what's the behind-the-scenes?

Omar Sanseviero3:20

It's complex. The Gemma team is actually relatively small. We have, like, uh, two or three PMs, we have one marketing person, and then the rest are, like, engineers and researchers working on shipping this. Uh, of course, there's, like, the full training part. We... How do we do the post-training, distillation, post-training techniques, and so on. What is quite exciting is that once we have the model, then we collaborate with a bunch of open source partners, right?

So for example, we work with Llama CPP, Ollama, MLX, Hugging Face, vLLM, NVIDIA, AMD. So we have, uh, almost 50 external partners for every launch... Well, for the Gemma 4 launch, which has been the most complex launch. And also internally, we collaborate with a bunch of different teams. So think of Google Cloud, Vertex, Vertex Model, Models as a Service, ADK, uh, and then Android as well, right?

So we work, for example, with the Android team. And, uh, with the launch of Gemma 4, we released an integration with Android Studio. So in Android Studio there is this agent mode where you can have a, a model helping you write code and do things within Android Studio. And they ship this, uh, integration with offline models using Llama CPP or vLLM or any OpenAI-compatible endpoint.

So now you can use Gemma 4 to also write code, uh, Android applications in Android Studio.

Offline vs API4:29

Host4:30

Where's the difference? When would someone wanna do that versus just-

Omar Sanseviero4:33

Gemini

Host4:34

... use Gemini?

Omar Sanseviero4:35

Yeah, yeah. Of course.

Host4:35

Outside of the obvious you're offline, uh, or you want the privacy-

Host 24:39

You, you fly planes a lot or something.

Host4:41

I did... Okay, I will say, on my long 10-hour flight to London, I did use Gemini as my-

Host 24:46

Yeah, I, I was on Gemma 4, though.

Host4:47

Sorry, Gemma. Gemma.

Host 24:48

Yeah.

Omar Sanseviero4:48

Yeah, yeah. It's m- mostly offline use cases, right? Uh, or if you... Yeah. Offline or privacy, like if you want to have all of your development set up locally and you don't want to send any code to, to any API, you would use that.

Host4:59

Do you see a future where, you know, small models get good enough? Like, does it cannibalize? It's an interesting position. Like, you have big Gemini, you have Gemma. Both get exponentially better over time. Like, current Gemma is much better than what we had open source a few years ago.

Omar Sanseviero5:15

Yeah. Yeah, for me it's quite exciting. I mean, if you look at Gemma, you compare to how we were one year ago, I would say Gemma, uh, 4 is matching state of the art from one, one and a half years ago for most things. With local models or models that you can run in your own hardware, you can get capabilities.

So you can get agentic ski- agentic capabilities, function calling, system instructions, like conversational and that kind of stuff. Knowledge is much trickier. So for knowledge you would need a larger model, right? That's why if you compare Gemini to Gemma, Gemini, uh, has much better knowledge understanding of the world, right? Like, uh, facts, information, and so on. So it really depends.

I do think we are heading towards a future in one, two years where, imagine, like, you can run a Gemini 3 Pro powerful model direct in your phone, right? And I think once we get there, things will be quite exciting, uh, from a product integration, from which experiences we can, uh, enable the users. Um- I wouldn't say it cannibalizes.

It's still, like, two very different things. Like, if you want, like, flagship capabilities, like these super complex, long-running tasks, you would use Gemini if you need factuality and so on. But I do think for many of these agentic things, we'll get to a point in which we can do very powerful things directly, uh, on device.

Multimodal6:26

Host 26:27

Yeah. Can we talk about the multimodality-

Omar Sanseviero6:29

Mm-hmm

Host 26:29

... uh, sides? Any advances there that you wanna highlight or you've been getting good feedback on?

Omar Sanseviero6:35

Yeah. So Gemma 4 was built on the same research as Gemini 3, uh, which pretty much means that we benefited from all of the improvements that happened with Gemini 3. Uh, multimodal-wise, uh, the smaller models can understand audio, images, and short videos, so 30 to 60 second videos and, and audios. Uh, for out-

Host 26:52

Which is actually quite long.

Host6:53

And even the on device, like the 2B-

Omar Sanseviero6:56

Yeah

Host6:56

... 4B2B can do-

Omar Sanseviero6:58

Yeah, yeah

Host6:58

... very good multimodality.

Omar Sanseviero6:59

Yeah. For audio we have a speech recognition. We have a speech to translate the text, and then a bit of a speech understanding. So you can do, like, a... Ask questions about an audio file and so on. So use cases that are very optimized for, like, on device phone use cases. Uh, and then on the vision side we also improve things quite a bit.

So we have object detection, pointing, captioning. Uh, we do not have image segmentation, which we know is, like, one thing that many people have been asking us. But otherwise, like, for many things, uh, we do support that. The other thing we do not support yet is video with audio. So we can u- understand, like, video input or audio input separately, but if you want to pass, like, in the same prompt both the visual part and the audio part, we still need to do some improvements around that.

Host 27:43

And that's just a matter of, like, more data or-

Omar Sanseviero7:45

Probably some additional fine-tuning could yield some very good baseline model for this.

Host 27:50

Yeah. Yeah. What about audio out?

Omar Sanseviero7:52

We are exploring some things here.

Host 27:54

Yeah.

Omar Sanseviero7:54

Uh, nothing I can share at the moment. Yeah.

Host 27:56

I think e- everyone's excited about the... Like, when do you have native speech-to-speech, right?

Omar Sanseviero8:02

Yeah.

Host 28:02

But as far as I see, people always get excited, and then the pipelines always win.

Omar Sanseviero8:07

Yeah. Yeah, yeah. Gemma is quite important for us, the multilingual aspect as well.

Multilingual8:08

Host 28:12

Ah, yes, yes.

Omar Sanseviero8:12

So Gemma supports 640 languages, uh-

Host8:16

You did a lot of work on the multilingual encoder, the tokenizer, right?

Omar Sanseviero8:20

The tokenizer, right. Right. So-

Host8:21

For adding.

Omar Sanseviero8:22

Yeah, exactly. So the tokenizer has been pretty much based on the Gemini tokenizer. It's extremely good. So independently of the Gemma capabilities, if you just pick base Gemma model and you fine-tune for an additional language, uh, it actually works extremely well. Uh-

Host 28:36

What are some... Uh, sorry, I didn't read that part. What, what are some insights on the tokenization?

Omar Sanseviero8:39

Uh, this comes from Gemma 3. Like, this has been done already for over a year, but the tokenizer is pretty much the same as, as Gemini, which means that the tokenizer, uh, lends itself to capture the right tokens for different languages. It's like a very good multilingual tokenizer. Which means that if you compare Gemma, uh, 3, so I'm going to the previous generation.

If you compare Gemma 3 to other models from back then, maybe the other models were better than Gemma 3, like, as general, uh, model. But if you train all of these models, uh, for, I don't know, uh, a specific Southeast Asian language, I don't know, Vietnamese, let's say, uh, Gemma would yield better results even if the o- other base models were potentially better.

Host 29:17

Yeah. I mean, I, I, I think there is some limit at which you basically have platonic representation, right? Like, you understand the core concept and it translates to whatever language you want.

Omar Sanseviero9:28

Yes.

Host 29:29

I guess, you know, you are also... You, you have purview over all of, uh, sort of Google developer experience, and you brought the team here for the first time.

DevRel at AIE9:30

Omar Sanseviero9:37

Yeah.

Host 29:38

What was that like?

Omar Sanseviero9:38

It's quite exciting, to be honest. Uh, we have already participated in previous AIE Europe conference. Like, Philip or, like, other team members have been in some of these in the past. This is, uh, London. This is DeepMind's home. Uh-

Host 29:50

We have to.

Omar Sanseviero9:51

Yeah, we have to. I mean, we brought- ... a, a bunch of researchers from the team to share about different things that we're working on. We've brought other teams, uh, from Google, not just from DeepMind, that are also, like, using AI in one way or another. So we brought people that, that are doing on-device machine learning, people that are doing lighter TS or optimizations to run models directly in phones or in the browser.

We brought people from the Android team. We brought people that are working all over Google, from robotics to research to Android. Uh, so yeah, it's quite exciting to come here and really show all of the things that the, the company's building. Uh, not just come and share, like, the things that our team is doing, but really all of the, uh, overarching AI, uh, story that we're-

Host 210:29

Yeah. I think you are... I mean, it is the lab with the biggest scope.

Omar Sanseviero10:32

Yeah.

Host 210:32

Right? You do, do everything, including dolphins. Uh, and it's very impressive. Like, yeah, so, so you brought Sander.

Omar Sanseviero10:39

Yeah.

Host 210:39

Uh, we- would you talk, talk a little bit about the researchers that you brought.

Text Diffusion10:42

Omar Sanseviero10:42

We brought researchers in a couple of different topics.

Host 210:44

Yeah.

Omar Sanseviero10:44

So we, we brought one of the researchers that worked, uh, in the Gemma development, in the development of Gemma 4. We brought a researcher that works in diffusion models as well, uh, for, uh, diffusion transformer models. So, uh, diffusion as-

Host 210:56

Text generation

Omar Sanseviero10:57

... text generation.

Host 210:57

Yes.

Omar Sanseviero10:57

Not, not image, uh, generation.

Host 210:58

Which was announced but not released.

Omar Sanseviero11:01

Exactly. We did the Gemini diffusion last year-

Host 211:04

Yeah

Omar Sanseviero11:04

... at IO. Uh, which is very cool because, uh, you can, uh, generate code extremely quickly, right? Like, it's, uh, yeah, it's stupidly-

Host 211:10

Yeah. So the main pitch is speed.

Omar Sanseviero11:12

Yeah.

Host 211:12

But other than speed, is there, like, a secondary, you know, what can we do with a diffusion model that we cannot do with autoregressive, you know?

Omar Sanseviero11:18

It's mostly speed.

Host 211:19

Okay.

Omar Sanseviero11:20

Yeah.

Host 211:20

I, I feel like in terms of code structure, there may be some things where you're like, "Okay, I want the brackets here."

Omar Sanseviero11:26

Yeah.

Host 211:26

And then you fill in the blanks, right?

Omar Sanseviero11:27

Yeah.

Host 211:27

So fill in the middle-

Omar Sanseviero11:28

Yeah

Host 211:28

... is, like, a common code problem, but this is extended fill in the middle or, like, extended, like, "Oh, help me upscale or put a LoRA."

Omar Sanseviero11:36

Yeah.

Host 211:36

I don't know. You know, translate the image analogy-

Omar Sanseviero11:39

Yeah

Host 211:39

... to text.

Omar Sanseviero11:40

Yeah, I think in the past fill in the middle was, like, this task that many companies were trying to tackle as an additional generation task, and now people are just assuming that the model can do fill in the middle with a general-

Host 211:52

It's more autoregressive.

Omar Sanseviero11:53

Yeah. Exactly.

Host 211:54

Yeah, yeah. No, no tricks about special tokenization-

Omar Sanseviero11:57

Yeah

Host 211:57

... or anything like that.

Omar Sanseviero11:58

Exactly.

Host11:58

It used to be a, you know, mass language modeling. You, you're trained to predict fill in the middle.

Omar Sanseviero12:03

Yeah.

Host 212:04

You had to rearrange your data set as well in order to, to do FIM.

Omar Sanseviero12:07

Yeah, it was a bit tricky. People were always getting, like, the, the prompting, the, the, the tokens wrong, and yeah. If you deviated in any way from the training format, it didn't yield-

Host12:16

Yeah

Omar Sanseviero12:17

... good results. Now we have, like, very good out-of-the-box capabilities for that.

Host12:20

What's the in- what is the idea about investing in text diffusion? Is there a world in which this overtakes autoregressive?

Omar Sanseviero12:27

Uh, yeah, that's a good question. I think at the moment it's still very experimental. Uh-

Host12:30

Yeah

Omar Sanseviero12:30

... I think we'll be releasing and sharing a bit more research of the things that we have been doing around diffusion, uh, generate- uh, text generation models on that space. I would say it's still very early stage.

Host12:40

Yeah.

Omar Sanseviero12:40

I think, uh, especially, like, the model quality is still a bit worse from what you would get from the, a normal autoregressive model.

Host12:46

Yeah. A lot of what you were mentioning earlier about it for, you know, okay, fill in this code, lock this stuff, it seems different to how we're building agents these days of, you know, sequential tool calling, this, that.

Omar Sanseviero12:56

Yeah.

Host12:57

Uh, I guess it's... If it's just speed, it's speed.

Omar Sanseviero12:59

Yeah.

Host12:59

If it's an RNC, but it's just-

Omar Sanseviero13:00

I could see, I could see a world where there's, like, system one, system two. System one is the diffusion one. System two is autoregressive. System one is the, the planner.

Host13:08

Yeah.

Omar Sanseviero13:08

System two is the executor. I don't know.

Host13:11

Yeah, could be.

Omar Sanseviero13:11

Maybe, uh, it's, it's, it's too hypothetical at this point, I think.

Host13:14

Yeah.

Omar Sanseviero13:15

You know? But I will say-

Host13:16

The diffusion-

Omar Sanseviero13:16

Yeah

Host13:16

... diffusion transformer models are difficult to fine-tune as well. Uh, so-

Omar Sanseviero13:20

Yeah

Host13:20

... so there's also, like, a point in which, uh, uh, how much flexib- like, yeah. I, I could see a world in which, yeah, you have, like, a very strong, uh, agent manager kind of a setup, and then you have, like, executors, like diffusion-based executors that, uh, do, like, a specific coding. Are people fine-tuning outside of... You know, we see a few big companies do...

Fine-Tuning13:37

Host13:40

Okay, like, Cursor has a-

Omar Sanseviero13:41

Yeah

Host13:41

... really good consistent model. There's a few that have done fine-tuning, but it seems like it's not picking up as, you know.

Omar Sanseviero13:48

Yeah, so there was this period, 2024, I think, which there was, like, this... Maybe 2023. Like, there were all of these fine-tuning communities, and I think it's been changing quite a bit over the last two years because models are getting very good out of the box. So as I was saying, like, for Gemma 4 we had 50 to 60 partners.

Uh, and some of them were like, "Oh yeah, we're going to try and fine-tune, uh, the 27B model for this vision task." And they, and they were like, "Oh, actually, uh, the model works too well out of the box. We don't need to fine-tune it."

Host14:18

Yeah.

Omar Sanseviero14:18

Yeah, we saw lot, lots of those things. So I'm seeing this excitement around fine-tuning nowadays as general conversational models.

Host14:26

Yeah.

Omar Sanseviero14:26

There is still quite a bit of excitement around fine-tuning for specific domains like finance, uh, healthcare, specific types of data that the model didn't see. But as general conversational, like, just changing how the model behaves, you can do most, most of that via prompting nowadays, and in terms of capabilities, the models are very good out of the box.

So it's been changing quite a bit. There is still, like, the onslaught people. I don't know if you know, uh, Daniel Khan and his brother and Michael.

Host14:51

Every year I give them a three-hour workshop to just talk about-

Omar Sanseviero14:53

Yeah, yeah. They- ... they, they are the GOATs. Uh, they still do, like, amazing tools for the community to fine-tune, and the community use those tools. But I'm seeing, like, some changes in the trends. I think, uh, people are not fine-tuning that much anymore.

Host15:04

And you guys put out a version of your own. Med-Gemma is a fine-tuning of Gemma 4.

Omar Sanseviero15:08

Yeah, yeah. So Med- Med-Gemma, uh, the last Med-Gemma, which we released three months ago, Med-Gemma 1.5, it's, uh, based on Gemma 3.

Host15:15

Gemma 3.

Omar Sanseviero15:15

Yeah, Gemma 3. So it's pretty much Gemma 3 and then additional training with some of our medical data sets.

Host15:20

Yeah. How do you see, uh... If I'm not mistaken, Apple foundation models on device were a bunch of LoRAs for different tasks.

Omar Sanseviero15:26

Yeah.

Host15:27

And when you're constrained or running on device small efficient models-

Omar Sanseviero15:30

Yeah

Host15:30

... uh, you guys did a offload, so you're, like, caring about efficiency.

Omar Sanseviero15:33

Yeah.

Host15:34

Um, but, you know, do you see a world of multi LoRAs for tasks? Should people be fine-tuning the small one?

Omar Sanseviero15:41

I think this is a big challenge in general in the whole developer ecosystem because let's say that you want to have 20 apps in your phone, right? And let's say that each of those apps comes with its own LoRA, right? What happens when you update the model, the base model? You also need to update all of these LoRAs.

So from a developer point of view, I think it will be very tricky because, one, you don't want to have 20 different base models in the phone of the users. The battery will just die. Uh, you also don't want to have to update 20 LoRAs every time you update the base model, right? So the release cycles in the Android world are, uh, and in the iOS world are very different.

So yeah, I think it's more of a general industry challenge that, uh, we need to f- figure out, uh, how we think that people should build ML, like, on device, uh, phone, uh, power, like, AI experiences.

Host16:25

Yeah.

Omar Sanseviero16:25

It's, it's more of a product and developer experience kind of challenge.

Host16:29

Yeah. I have a question about the bigger Gemma models.

Sparse vs Dense16:29

Omar Sanseviero16:31

Yep.

Host16:32

So you have two models that are-

Omar Sanseviero16:33

Yep

Host16:33

... pretty similar size. One is dense.

Omar Sanseviero16:35

Yeah.

Host16:36

One is MoE.

Omar Sanseviero16:36

Yep.

Host16:37

Uh, can you talk a bit about, okay, say you have a 27B you're putting out.

Omar Sanseviero16:42

Yeah.

Host16:42

Uh, how do you think about should I build an MoE? Outside of inference and using it-

Omar Sanseviero16:46

Yeah

Host16:46

... how do you think about when to do MoE versus dense?

Omar Sanseviero16:49

Yeah.

Host16:49

What are the trade-offs?

Omar Sanseviero16:50

Of course it's inference. Uh, what, what else is there?

Host16:52

Yeah, but then there's two at the same size. Pretty much.

Omar Sanseviero16:54

No, yeah, but I mean, one is 31B, which is dense, right? And that's like the most raw intelligence, and then you have the 27B with, uh, 4 billion activated parameters.

Host17:02

Right. But, like, you know, why not a 31B, 5B active, for example?

Omar Sanseviero17:05

Yeah, I mean, we-

Host17:06

You can just fit more in dense?

Omar Sanseviero17:07

Yeah, I mean, we did quite a bit of experimentation and, like, research on, like, which would be the best sizes that would be friendly to developers, and we chose... We, we made decisions around that, right? Uh, uh, the 31B is really, like, the largest model size that a quantized would fit in a consumer GPU. The 27B is more, like, an extremely fast inference, uh, within those constraints.

MoEs are challenging to fine-tune. Uh, I, I don't know if, uh, we've talked about that in the past, but MoEs in general are, like, extremely good. They architecture, they work great for inference. But when people fine-tune them, they struggle a bit. Like, they are not as easy to fine-tune for instruction following. The standard recipes and hyperparameters that you have may not work out of the box for MoEs.

Host17:47

The intuition is the, the routing kills the backprop or what?

Omar Sanseviero17:50

I, I, I think so. Uh, I, I, I don't have a very strong intuition on it either-

Host17:54

Yeah

Omar Sanseviero17:55

... to be honest. Uh-

Host17:56

People always say this, but I'm-

Omar Sanseviero17:57

Yeah

Host17:57

... I'm trying to say why, right? 'Cause if you can train it, you can fine-tune it. Like... Fine-tuning is just training.

Omar Sanseviero18:03

Yeah, I th- I think it's a mix of the routing and, yeah, just having, like, different distributions, and the distribution may affect the routing in a different way than a, a dense model, which we just change the things. Uh, that's kind of my intuition, but also, like, I think there are many different variables here, like how many, uh, experts do you trigger or, uh- Yeah.

They, they are like a bunch of different parameters that you can move, like whether you freeze or you don't freeze the-

Host 218:27

Yeah

Omar Sanseviero18:28

... the router, like a bunch of things that you need to, to think about.

Host 218:30

Yeah. To me, the most important, uh, asymptotes that I'm looking for are what is the minimum sparsity level that we-

Omar Sanseviero18:39

Yeah

Host 218:39

... can reach, and then what is the most, let's call it Elo per byte.

Omar Sanseviero18:44

Yeah. No, yeah, that- that's the thing that we discussed quite a bit, like what's, uh, the intelligence per parameter, right? Like, how do we maximize this intelligence per parameter? Because-

Host 218:52

There has to be a number, right? Then we can stop, right?

Omar Sanseviero18:53

Yeah, and because if you compare like the tw- I mean, Gemma, we have done the same size, right? 27, like almost 30 billion, around 30 billion parameters for Gemma 2, 3, and 4.

Host 219:01

Yeah.

Omar Sanseviero19:01

And the intelligence is much higher, right? Like we have now increased the model size.

Host 219:05

Yeah, it's like that, that, that.

Omar Sanseviero19:06

Yeah.

Host 219:06

Yeah.

Host19:06

It was an easier number when everything was dense. Then you have to add in sparsity. Now you have offloading.

Omar Sanseviero19:11

Yeah.

Host19:12

So.

Omar Sanseviero19:12

Yeah, you cannot compare like a MoE to dense models. There's, there is no... There are some like, uh, napkin calculations you can do to compare, but it's not apples to apples. But that's a good question. Like, I don't know like where we'll be in three years from now. I would assume like a 30 BP, uh, model par- uh, parameter model could be extremely powerful.

I still think there are limitations in terms of knowledge, so maybe the model will be able to do like-

Host 219:34

Yeah, it's just-

Omar Sanseviero19:35

... super wild agentic stuff, but it will not know like who was the president of X country 25... I mean, maybe, yes, but like very niche knowledge probably the model will not have.

Host 219:44

Yeah. Um, it- there's just- this is just information theory, right?

Omar Sanseviero19:46

Yeah.

Host 219:46

Like you're using the model as a database.

Omar Sanseviero19:48

Yeah.

Host 219:48

So of, of course there's gonna be limits. The other thing is also I always think about, uh, when, when we talk about this topic, superposition, right? Anthropic has this whole concept of superposition where you can store information in the smaller weights as- because it compounds with the other weights as well.

Omar Sanseviero20:02

Yeah.

Host 220:03

And so, um, not that much research on it since then, but, uh, maybe this is my segue into McEnturf, uh, uh-

Host20:09

Gemma Scope?

Gemma Scope20:09

Omar Sanseviero20:11

Yeah, so last year in December, we released Gemma Scope. So Gemma Scope pretty much allows you to, uh, analyze the, the activations across different layers based on the tokens, uh, input.

Host 220:20

Yeah, it's fantastic.

Omar Sanseviero20:21

And yeah, the team released, I don't know if it was couple of terabytes, maybe even up to like one petabyte of, uh, data that we had to store because we did that for every single layer across all of the Gemma 3 models, so it's a very complete-

Host 220:33

And Llama as well?

Omar Sanseviero20:34

We did it just for Gemma 3.

Host 220:36

Oh, okay. I-

Omar Sanseviero20:36

Yeah

Host 220:37

... I think Neuronpedia had some others.

Omar Sanseviero20:39

Could be. Could be.

Host20:40

There's a few other teams, uh-

Host 220:41

I was like, wow

Host20:42

... Illya was doing.

Host 220:42

Yeah, it was, it was very, very cool, the cross lab, uh-

Omar Sanseviero20:45

Yeah

Host 220:45

... partnership.

Omar Sanseviero20:45

Yeah, yeah. There are a couple of open source tools there as well that you can just do- create your own, uh, yeah, your own activation, uh, uh, networks. Uh, yeah.

Host 220:55

Yeah, yeah.

Omar Sanseviero20:56

It's a niche field. I think it's a, it's a good opportunity. I think we were talking about this earlier, right? Like it's an area where you don't need lots of compute to get started. That allows you to understand like how the model works. You can experiment. You can get a bit of a sense of how, yeah, how transformer architectures work.

Host 221:11

Yeah.

Omar Sanseviero21:11

So it's a good area.

Host 221:12

Okay. The context of this is really like why bring researchers to AI engineer, which is an engineering applied AI conference. Uh, one, to me, uh, it is actually very important that you bring the researchers because engineers want to learn about how the models w- that they use were trained, even if they never, ever trained it themselves.

Omar Sanseviero21:27

Mm-hmm.

Host 221:27

Right? Because I think they, they just feel more trusting of the model-

Omar Sanseviero21:31

Yeah

Host 221:31

... if they, if you peel back the curtains a little bit. And also, uh, I think there's some prestige, that people want to feel like they can go home and talk about it intelligently, even if they- ... they don't actually, you know, know how to train it. The other thing is like, I, I do think that research and engineering are closer than people think.

Omar Sanseviero21:47

Yeah. Totally.

Host 221:48

Uh, there's, I mean, there's research engineers.

Omar Sanseviero21:50

Yeah.

Host 221:50

And McEnturf is probably the easiest, single easiest way that engineers can get into research if they want to.

Omar Sanseviero21:55

Yeah, I think in, I mean, in big part, like so many researchers are doing ablations, right? Like they are just-

Host 222:00

Yeah

Omar Sanseviero22:00

... moving the pieces around and seeing what works and what doesn't work. Uh, of course, there's like a branch within research that is more- much more like architectural design and like, um, much deeper, but there's lots of very like empirical experimentation and seeing what works, what doesn't work, uh, moving things around, uh, which for me is, is mos- more engineering, uh, rather than-

Host 222:19

It is, yeah

Omar Sanseviero22:20

... for like research unless we are like creating new activation functions maybe. But-

Host 222:24

Yeah. I think this maybe is a change in your career as well. Like it used to be a, like a joke like, haha, our researchers are terrible at coding. And then they throw it across the wall to some engineer that will, that will clean up the code.

Omar Sanseviero22:35

Yeah.

Host 222:35

But now everyone has their own personal research engineer, right?

Omar Sanseviero22:38

Yeah. And something that is cool that is happening is also how researchers begin to adopt some of the cool agentic tools now. So for example-

Host 222:46

Yeah

Omar Sanseviero22:46

... within the team we are building skills to do experiments and ablations and evaluations, and how the research team can use all of these agentic tools as part of their research process is also quite interesting.

Host 222:56

Yeah, yeah. I had Yitae, uh, on my podcast who led the post-training for IMO, the IMO Gold's, uh, model. I think it was Deep Think. Um, and he was notably, he was an AI researcher that doesn't use AI-

Omar Sanseviero23:06

Yeah

Host 223:07

... until this year.

Host23:08

It's, it's gone even further. People making novel math research, like some of the Erdos problems-

Host 223:13

Yeah, yeah, yeah

Host23:13

... they are engineers, not researchers, with no background in math just, you know, using coding agent-

Host 223:19

Mong- mong the math, guys.

Host23:20

Yeah.

Host 223:21

I mean-

Host23:21

Just not math, not research, but you know, solving some of the most, you know-

Host 223:26

Unsolved problems

Host23:26

... unsolved problems, yeah.

Omar Sanseviero23:28

Yeah. But e- even in the model architecture side of things, like two years ago when all of these people started to fine-tune models and to do experiments and do model merging, there was quite a bit of research that was happening in GitHub and in Reddit and in local Llama, and people were actually like inventing new things, and then there were papers-

Host 223:44

Yeah, Franken merges, uh, yarn.

Omar Sanseviero23:45

Yeah, like all of the Frank- Franken MoE, uh, stuff, like all of the Axolotl library, like all of these tools, and there were papers published by different companies and research labs one or two years later that were rediscovering what was already done by the Reddit or Discord people without, yeah, anyone noticing.

Auto-Research23:59

Host 224:01

Yeah, yeah. Do you have a take on auto-research? Every AI wave has a auto ML wave.

Omar Sanseviero24:05

Yeah.

Host 224:06

And this is the auto ML wave of this wave.

Omar Sanseviero24:08

Always been a bit skeptic. I mean, auto ML few years ago was mostly like just a-

Host 224:12

Parameter search

Omar Sanseviero24:13

... search. Yeah, yeah. Pretty much like research in, in this higher, yeah, parameter space. Uh, I don't know, like with Carpathia experiments it's been quite interesting to see- ... like, uh, how things are evolving. I w- I don't know what's your take on this.

Host 224:24

Things are just cooler when he does it. Uh, I do think some, some part of this is you're just speed running experiments agentically, right? The agent-

Omar Sanseviero24:32

Yeah

Host 224:32

... the coding agent is more autonomous. You can actually go to sleep. And it will do the things that you would've done anyway, so you're just kind of automating things that you would've done.

Host24:39

I see it differently.

Host 224:40

Yeah.

Host24:40

I think, uh, okay, it will be a very exciting time if we have a move 37 from an auto-research.

Host 224:46

Yeah, yeah.

Host24:46

If you make an impactful discovery that someone wouldn't have thought of, right? So there's the side of, okay, I have these ideas, go run them in the background, that's fine.

Host 224:54

Yeah.

Host24:54

But the, the interesting side is actually when you're shooting off not just paths that you wouldn't have thought about, but you know, trajectories that people wouldn't think about and they work and you make new discoveries.

Host 225:05

Yeah.

Host25:05

That, that's the very exciting thing. I think when you have more approach to just token spend and send off, you know, hopefully that becomes possible.

Omar Sanseviero25:14

Yeah, I do think the next generation of fine tuners will not be l- I mean, will be people that are not coding at all, right? Like, uh, one year ago we had to write, like, our own code, uh, with transformers or Unsloth or, or whichever library of your choice. I do think as we, like, keep evolving, like most people will be fine-tuning with a couple skills, right?

Like Hugging Face has the skills, like all of these libraries have skills. They will just, uh, prompt their agent to kick off like some experiments and see what works, what doesn't work. And we-

Host25:39

And honestly, it's a, it's a good middle ground right now. Like all the tools you've mentioned, they let you fine-tune in minutes.

Omar Sanseviero25:45

Yeah.

Host25:45

You don't need to know what's happening under the hood at all.

Omar Sanseviero25:47

Yeah. So I think that's where, like the direction will be, like people that just want to do fine-tunes to improve the capabilities for certain domain or like add some like new behavior, they will not be coding the fine-tuning code. But of course, if you want to do like deeper research in the architecture, my hunch is that most likely, uh, this will not be like a automatable, at least in the next one or two years.

Host 226:06

Okay. We gotta wrap up soon.

Team Expansion26:06

Omar Sanseviero26:07

Yep.

Host 226:07

Uh, I just wanted to end a little bit on your, your, the growth in your team. Um, and you know, Paige is here, uh, Logan is, is o- over in SF, uh, and you've been hiring all my friends. Uh, Thor and Ivan-

Omar Sanseviero26:21

Yeah

Host 226:21

... and all these. Um, what does the team look like? Where are you looking to grow?

Omar Sanseviero26:25

It's been quite exciting. We are hiring lots of h- very high agency people. Yeah, I think maybe three, four years ago we, we did a, like a nice interview about how, uh, I was growing like, uh, DevX, uh, at Hugging Face and-

Host 226:37

Yes

Omar Sanseviero26:37

... how we were thinking like DevRel should look like. DevRel, I think mainly is also interesting is redefining what DevRel should be in an AI, very AI-centric organization at the frontier. It's our research lab at the end of the day, and we are in this AI era. So it's also rethinking what DevRel should look like, uh, in 2026.

We are, yeah, we're hiring pretty much like high agency people, excited, uh, to, to build things, to engage with the community and so on. Right now we are growing in Singapore, so we are looking to hire someone in Singapore and also-

Host 227:05

Coming to AI Singapore.

Host27:06

They're coming.

Omar Sanseviero27:07

Yeah. And to hire someone in India. So those are like two locations which-

Host 227:09

Why is Singapore so important?

Omar Sanseviero27:11

So Singapore is interesting. Singapore has a relatively small but very high, like very dense, uh, high talent community. Now we have a proper DeepMind office as well. Like it's small, but, uh-

Host 227:22

Right

Omar Sanseviero27:22

... it's growing quite quickly as well. Uh-

Host 227:25

Mainly because Yitae doesn't want to move.

Omar Sanseviero27:28

Yeah, yeah. So-

Host 227:28

But it's a huge win for Singapore. We don't have research in Singapore usually.

Omar Sanseviero27:32

Yeah. So we're trying to grow the team in places co-located to like people doing like traditional DeepMind-y research-y activities. We don't want like, uh, to have like-

Host 227:41

Sales office

Omar Sanseviero27:42

... people that is in a single town that is not connected to anyone in person from, from DeepMind. So ideally if they go to the office, they can talk with researchers doing like their own... Even if it's a different project, they can be part of like the more DeepMind-y side of things. So, so yeah. So we have-

Host 227:56

This is good

Omar Sanseviero27:56

... like people in, in Paris, in London, in, I mean, Zurich, in SF, New York, so all of these, uh, DeepMind hubs, and Singapore now is becoming like a very small but very exciting hub as well.

Host 228:06

Good.

Host28:06

This is all the DevRel, DevX team in DeepMind, right?

Omar Sanseviero28:09

Yeah.

Host28:09

DeepMind has also expanded a lot here. Like-

Omar Sanseviero28:12

Yeah

Host28:12

... a few weeks ago, Kaggle joined DeepMind.

Omar Sanseviero28:14

Yeah.

Host28:14

How's, how's the org in general shape?

Omar Sanseviero28:16

DeepMind in the past didn't do that much product and yeah, now we have like a AI Studio-

Host28:21

Even DevRel

Omar Sanseviero28:21

... the Gemini API, now Kaggle. But Kaggle is, is part of the team. Actually Kaggle is also here.

Host28:25

Yes.

Omar Sanseviero28:25

There are 50 members here talking about the-

Host28:27

Very excited.

Omar Sanseviero28:28

Yeah. Talking about evaluations.

Host28:29

Yeah.

Omar Sanseviero28:30

Uh, they, last week they released a, a new system for agent evaluation. It's like a very, like experimental initial benchmark, but pretty much allowing agents to take an exam and compete in a leaderboard, which is always fun. Yeah. When, with, with Kaggle joining us, I think there are a couple of exciting things. There's a whole Kaggle community hackathon things that enables the community to build hackathons, but there is also the Kaggle benchmarks, and I think Kaggle benchmarks can connect very well with how we think about Gemini and the capabilities.

And if you're in the eval space, like, you know, like many benchmarks can be benchmarked. Uh, many people are gaming the benchmarks, and we want to identify like which are these capabilities that maybe we are not aware that we have or that maybe we could improve and bring all of that feedback from the com- benchmarks created by the community in an organic way and bring the, all of that feedback back to, to the model itself.

Host29:18

Yeah.

Omar Sanseviero29:18

I mean, the way we are doing Gemma, Gemini, and all of our tools is really like based on the feedback from the startups, the community, the developers. So that's why you see like Logan, Paige, everyone in the team talking with the community in social media, in forums, in events, and really understanding what people are building, uh, with our tools and bringing all of that feedback to, to the modeling teams, uh, which is very cool as being part of DeepMind.

Host 229:42

Yeah. Yeah. Well, you guys are doing amazing work. Thank you so much for joining us here and, uh-

Omar Sanseviero29:46

Yeah. Thank you

Host 229:47

... can't wait to see what's next.

Omar Sanseviero29:48

Yeah. Thank you for having us here.

Host29:49

Yeah.