LALatent SpaceJun 5, 2026· 1:17:57

When AI Agents Run Businesses — Lukas Petersson and Axel Backlund of Andon Labs

Lukas Petersson and Axel Backlund of Andon Labs reveal how AI agents running vending machines and physical stores expose alarming behaviors—Claude tried to call the FBI over a $2 fee, formed illegal price cartels with other agents, and lied to customers about refunds. Their Vending-Bench and Project Vend stress-test frontier models in real-world, dollar-denominated evals, showing that long-horizon autonomy drives Claude models into manipulation and existential meltdowns while OpenAI and Gemini remain cleaner. The duo argues that money-based benchmarks avoid saturation and that testing messy physical environments is essential for AI safety.

  1. 0:00Intro
  2. 2:09Vending Bench
  3. 17:43Project Vend
  4. 21:51Agent CEOs
  5. 36:06Bengt
  6. 41:00Our Mission
  7. 45:38Dangerous Behaviors
  8. 57:15ButterBench
  9. 1:05:48Luna's Store

Transcript

Intro0:00

Lukas Petersson0:00

Gemini and, and OpenAI don't behave this way. It's, it's really only Claude. One example is like for lying, it's mostly in its reasoning, uh, because you can like see that it's like-

Swyx0:10

Planning to lie

Lukas Petersson0:11

... it's planning to lie. Yeah

Swyx0:12

And it's also it can reason and do a different outcome.

Lukas Petersson0:15

Yeah, and but, but then for like creating price cartels, for example, which is illegal, uh, that you can just see which email does it send to, to the other ones.

Swyx0:25

Before we get into today's episode, I just have a small message for listeners. Thank you. We would not be able to bring you the AI engineering, science, and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content. We've been approached by sponsors on an almost daily basis, but fortunately, enough of you actually subscribe to us to keep all this sustainable without ads, and we want to keep it that way. But I just have one favor to ask all of you. The single most powerful, completely free thing you can do is to click that subscribe button. It's the only thing I'll ever ask of you, and it means absolutely everything to me and my team that works so hard to bring Latent Space to you each and every week. If you do it, I promise you we'll never stop working to make this show even better. Now let's get into it.

Welcome to Lukas and Axel from Andon Labs, and I'm joined by my, uh, favorite guest co-host anything security, safety, alignments, uh, Vibhu. Uh, welcome. Thank you for having us. Thank you. Let's match names to voices. Uh, maybe you want to take turns introducing yourselves. Yeah. I'm Lukas.

Axel Backlund1:32

And I'm Axel.

Swyx1:34

Let's introduce Andon Labs a bit. Like, how did you guys come together? Um, you have different backgrounds, but you're both Swedish. Uh, was that like a big part of it? Yeah. So when I went to high school, there was this really cool guy who had a superpower. He could code. So he made like the, the webs- or like the app for the, for the, for the school and stuff, and he was super cool, and I wanted to be like him, and that was that guy. Uh -

Axel Backlund1:57

I don't know about this.

Swyx1:58

So, so, so- But you went to different universities, right? Yeah, but same high school. I see. Uh, so we always said like, "Oh, once we graduate university, then, then we, we should start a company," and that's what we did. Wow. There you go.

Axel Backlund2:09

Yeah.

Swyx2:09

Okay.

Vending Bench2:09

Axel Backlund2:09

Yeah.

Swyx2:10

And about a year ago, you kind of burst onto the scene with Vending Bench, but like was there a thing be-before that that was like kind of like the inception?

Axel Backlund2:17

Yeah. So we did work, uh, with like Anthropic was one of our, uh, early customers in doing, uh, evals. So we did like dangerous capability evals. Uh, nothing we published openly.

Swyx2:28

Mm-hmm.

Axel Backlund2:28

But then we started thinking about doing some kind of, uh, public benchmark, and one thing that we really started thinking about, uh, was like long-running agents and specifically agents managing businesses. Um, 'cause-- and this was like early 2025. Uh, and I think the, the first like, you know, mentions of people will be running like one-person unicorns or even autonomous companies. So we thought, let's make a benchmark of how well can an agent run the probably simplest business, uh, possible, and, uh, that's probably, uh, running a vending machine. So that's the first public one we did. And it was very like th-there was almost no one that noticed it in the first couple of months, I think. Uh, so we released it in February last year, and then I think around Easter last year, we got like the first semi-viral tweet about it, uh, that someone else did.

Swyx3:21

Yeah. I mean, we tweeted a bunch, uh- ... when it came out and like tried our best.

Axel Backlund3:25

We tried.

Swyx3:26

And- It's the one at Anthropic, right? No, no, no, no.

Axel Backlund3:28

Project Ven.

Swyx3:28

Yeah. So, so this- So this is a classic thing we should get out of the way. Exactly. There's two versions. Everyone does this. Yes. Uh, there's Vending Bench, which is the simulated one, which we did like completely independently in February. Um, and then like Axel said, that was like, that was the thing that didn't get any traction at, in the beginning. But then some random person made a tweet about it- And you have the paper ... and that's, that is the paper. Oh. Correct. Yeah. Um, and then since we thought this was very fun, we thought like, oh, um, I think this is also like o- one thing with Andon Labs, like the way we kind of like decide what to do next and what projects to do, it's like what is... Like the heuristic we use is like what is fun? Is wha-what would be a fun project? And, and doing this in real life sounded quite fun for us, uh, and maybe also scientifically useful. So, uh, then we basically had this idea, and then we like-- But then we needed a place for it, and like putting it out in the public would probably not really work. Uh, would get vandalized and stuff. So we, we pitched it to, to the people we were already working with at Anthropic, and they were like, "Yeah, you can have space. This sounds fun." Um- I mean, it's like a small fridge, right? Yeah, yeah.

It's like a mini fridge. Yeah. You know, people-- There's like a stripe thing.

Axel Backlund4:36

Oh, yeah.

Swyx4:36

This was like very OG.

Axel Backlund4:38

This was like an iPad.

Swyx4:38

The early days. That's the OG one. Yeah, yeah. iPad on this. We saw it in June, like two, two months after- Yeah ... after it had been there. They upgraded a little bit. There's a security camera for making sure you actually Venmo the thing. Yeah. So like, uh, my impression-- I mean, okay, we're, we're going straight into Project, Project Ven because it's such a iconic thing. Yeah. I do want to cover a little bit of that, the origin story even before Project Ven and even into Vending Bench. I, I think a lot of people are like yourselves, like smart, interested in, in fu-future of AI, interested in developing evals. But how the hell do you just like walk into Anthropic's doors and like work with them, right? Like what, what is the-- What are they looking for? What, what works? And then maybe like when you launch, I always think like obviously it would be better to launch with a lab, but, uh, sometimes- It's harder to do than it seems. Yeah, exactly. So either, either of those like which are more sort of newbie beginner questions, but like I think it's w-meaningful advice to others.

Axel Backlund5:31

Yeah. I-- We, we get this question a lot, and I, I don't think our experience is, is maybe the best. Uh, but like the way we did it was that we just built a bunch of things that we had conviction would be useful, and then we just like set up a server and sent it to them for free to use. And then after a while they were like, "Oh, yeah, this is actually kind of useful. We should probably pay for this." Um, but that took a while. I don't know if this is like the, the best path to doing it, but that's how it went for us.

Swyx5:57

Yeah. I think maybe generally like building-- Like everyone is interested in good evals, and especially evals that like don't saturate that easily. So like if you can build an eval that like tests something novel, something useful, and you have like good separation of models, like your, uh-- the more advanced models, uh, rank higher than the worse models. And then you can, yeah, you can like publish it and, uh, try to get some traction, uh, sort of how Vending Bench got attention. Um, and then probably some lab will be interested or you can at least have something to reach out with, uh, when you're doing that. Yeah. I think you are in, you're in one of the few categories of like evals that correlate to real money. Like Freelancer was also last year, right? Where, uh, people solve actual Upwork. Was it Upwork or other tasks? Uh, something. Where's the, where's the-- It's like a dollar value, right? Forget your ELO scores. Forget your-

Axel Backlund6:48

Some-

Swyx6:48

... you know, zero to 100%. Like just go straight for dollars and like that's AGI.

Axel Backlund6:53

Yeah. And there's like a- I think the nice thing is that there's no ceiling. Like it- you can just- it never saturates because it could just make more and more money. Like-

Swyx7:01

Yeah

Axel Backlund7:01

... um, if there's like, oh, you- percentage-wise, then like you can't go above, uh, 100. And, and I think like all- even when you're not at the 100, I think a lot of these, um, evals have a lot of problems in them. So like actually it's like if you get to like 92 or something like that, many of them it's like then there's like there's no really no difference between 92 and 93 because the, the eval itself is problematic and has noise in it. And I think a lot of evals are saturated like that, but people like pretend that there, there's still signal in them. Uh, but there really isn't.

Swyx7:34

Yeah, like C-bench verified. Um, even Vending Bench 1 saturated, right? Maybe we can talk about that. Uh, may- may- and maybe set up Vending Bench for a lot of folks who don't know. Actually, like you know, things, things that were very basic like there's limited slots, like you have to pay rent. Uh, you know, th- these are elements where like it doesn't come across in the, in the narrative, but even being adversarial towards the, the, the agent, I think these are all like very interesting dimensions.

Axel Backlund7:57

I don't really think it's saturated, right? Like it-

Swyx8:00

No

Axel Backlund8:00

... it was more like the- it was not designed in a way that was really, um, like true to how AI developed. Like we had an agent harness in it that wasn't really how people used harnesses and, and stuff like that. Uh, so I think it wasn't really that it saturated, it was more like it wasn't really, um, the best benchmark.

Vibhu8:22

This is Vending Bench one, right?

Axel Backlund8:23

Yeah.

Vibhu8:23

Yeah, yeah.

Axel Backlund8:24

I think that like schematic maps sort of to Vending Bench 2 as well. Uh, but-

Swyx8:29

Including the email.

Axel Backlund8:30

Yeah, the email- ... the emails exist still. Exactly. Uh, and then we still- we simulate the purchases, and it's all like, uh, yeah, it's this very open environment for the agent to just run its business. And then for, yeah, Vending Bench 2 we did that, like you said, to, to just improve the harness. Uh, a lot of like nice, like easier, uh, improvements to make it easier for us to run as well. Uh, like when you make an eval you ideally want to- don't want to change it after you made it. So, uh, you want to make it really good and then not to really run all the models when you make an update because that's also really expensive with, with Vending Bench when you run the frontier models.

Swyx9:06

So like as an example, like one thing we didn't have, we didn't have prompt caching in Vending Bench 1, uh, because when we made Vending Bench 1, it wasn't really a thing. Uh, so that-- or that's just an example of like in Vending Bench 2, like we paid a lot more to run these things because we didn't have prompt caching. So for Vending Bench 2, that was one thing we added, and there was a bunch of things like this. Um, and that's-

Axel Backlund9:27

Well, the- also the conversations are a lot longer in Vending Bench 2, right?

Swyx9:31

I think it's kind of similar.

Axel Backlund9:32

Is it similar?

Swyx9:33

Yeah, I think it's similar.

Axel Backlund9:33

Okay.

Swyx9:34

The models at the time were worse, so they crashed out earlier. Um, and now they survive the full year all the time.

Axel Backlund9:41

Which is like thousands of turns.

Swyx9:42

Yeah.

Axel Backlund9:43

Uh, hundreds of thousands of, hundreds of millions of tokens output.

Swyx9:48

Yeah.

Axel Backlund9:48

That's the, that's the rough order of magnitude.

Swyx9:50

Yep. I always wonder about the harness. The harness matters a lot. It's your harness. Was there any question about like use cloud code, use something else?

Axel Backlund9:58

Yeah, I think our, our philosophy around harnesses is like we try to make something that's quite minimalistic, like quite simple. Like we don't wanna favor one model a lot over the other, but also don't make like a super complex harness so like it's obvious like a model may be lucky and just be good in one harness. Uh, so like it, it is similar to a lot of the harnesses out there in like you have the, uh, like a long-running loop. Um, you have some like a, a bunch of tools that are like quite, uh, self-descriptive for the agent, we think, and not a lot of like fancy sub-agents or anything, because we wanna really test the model, not like some specific, specific harness.

Vibhu10:37

It seems more neutral as well to test the model's agnostic of the harness, you know?

Axel Backlund10:42

Yeah. I mean, there are arguments like you want to elicit maximum performance of the model, um, but it's like a trade-off, like how much time should we spend optimizing the harness for each model? And like how do we know when we have like the optimal harness for a single model? So like we thought that just having a simple one that's the same for all of them is, is the best.

Swyx11:01

Well, so okay, this is my pitch for Vending Bench 3 or whatever, right?

Axel Backlund11:04

Yeah.

Swyx11:04

Uh, and then let-- you know, I like to have this kind of conversation on the pod, so like it forces listeners to think about what they would do if they were in your shoes.

Axel Backlund11:12

Mm-hmm.

Swyx11:12

So a lot of people are exploring self-modifying harnesses and, and I think prompt tuning for a model is a thing, and you are probably not doing a bunch of that. It's the same system prompt in every-- regardless of the model, same tools, whatever, right? Even if they were post-trained for different tools. So what, uh, what do, what do you think about like, okay, before I expose you to Vending Bench 3, I l- give you a few rounds of like self-tuning, whatever, whatever that means. Like-

Axel Backlund11:37

Like you give that to the model?

Swyx11:38

Yeah, give that to the model.

Vibhu11:38

Give that to the model.

Swyx11:39

Let it, let it read its own transcripts, let it s- modify its own system prompt based on like, "Oh, yeah, okay, well, that's-- this harness is not what I thought it-- what I was post-trained for, but I, I can adjust." Was that reasonable? Is, is that too much?

Axel Backlund11:51

Like philosophically, I, I like it because it basically good evals, they have a high ceiling, but they're hard, right? Uh, and they have no bias. And like this-- like when you have a system prompt like the one we have here, which is quite long, in like some kind of latent space, uh, representation, this might-

Swyx12:09

We have a bell that rings every time you say latent space.

Axel Backlund12:11

This, this might be like biased towards one model more than another for some reason that humans don't ex- uh, understand, right?

Vibhu12:18

I mean, we see it too, right? Like- Cursor says that they have individualized versions of the harnesses for all the models they run, right?

Lukas Petersson12:25

Right.

Vibhu12:25

There's better performance you can squeeze if you-

Lukas Petersson12:27

Yeah

Vibhu12:27

... tune the harness.

Lukas Petersson12:27

Exactly. And we might accidentally have picked one that favors another.

Vibhu12:32

Yeah.

Lukas Petersson12:32

Like, we don't know that.

Vibhu12:33

Yeah.

Lukas Petersson12:33

Uh, I mean, the-- like Axel said, like, the reason why we went for a simple one was to try to avoid this. But yeah, if you do it-

Vibhu12:39

Simple has biases .

Lukas Petersson12:40

But if you do it even less and, like, s- have no system prompt and let the model write-

Vibhu12:44

Write its own, yeah

Lukas Petersson12:45

... its own system prompt, maybe that's even less bias.

Vibhu12:47

Some of the interesting things there are, like, the harness also changes with model changes. Like, you can see it with the 4.7 release, right? A lot of people are saying 4.7 isn't as good as 4.6, and then, you know, there's rumors of, "Okay, you just need to prompt differently. You need to set up your harness differently." So it's not even like even if you have tailored your harness towards one model, it probably won't stay consistent, right? Like, the next iteration of that same model family will still change it so. But, you know, going back to what you said about Vending Bench 3, there is a lot of work being done on people saying you shouldn't have-- you can have self-modifying harnesses.

Lukas Petersson13:22

Yeah.

Axel Backlund13:22

Yeah. I think that's-that is definitely something we are thinking about. Um, not, I don't know, not to to say that we have Vending Bench 3, like, super imminent to launch, but, uh, yeah, it is for sure something that's interesting. But in our experience now, like, models are very bad at understanding what kind of tools they need to succeed at a task just with our testing, but that's very likely to change.

Lukas Petersson13:47

Yeah. It feels like they're very good at writing their assistants, right? Like, they're-they're good at writing tools for other people, but not for themselves .

Vibhu13:54

I think they're good at changing tools for themselves. So if you give them a baseline set of tools, and it sees, "Okay, I don't use this one as much," or, "Something here would be useful"-

Lukas Petersson14:02

Yeah

Vibhu14:02

... they would be able to add them. But going from scratch, probably not the best.

Axel Backlund14:05

Yeah. I think it depends on the, on the domain also. Uh, like, when we have tried this for, like, a Vending Bench similar domain, uh, like, the tools they need to have to, like, track inventory and things like that are, like, not super advanced, but still, like, quite, quite advanced. And, like, what we see is that they tend to, like, over-engineer everything a lot and, like, build things they don't really need and not, like, iterate continuously. Instead they just go like-like you would prompt Claude to just bui-build an inventory system for me, and then it will go and, like, do a bunch of complex schemas and stuff for you, and that's what the models are doing right now is what we see. But yeah, it, it would make a lot of sense to try to measure this improvement. Like, how well do they know what they need themselves?

Swyx14:46

Do we fully discuss Vending Bench 1, and we can go into 2? I, I don't know if there's any other high-level takeaways that people have about 1.

Lukas Petersson14:54

Yeah. I don't know. The headline thing was that this Claude called FBI, but maybe that's, uh-

Maybe that's-- we've heard that enough now.

Vibhu15:02

It did, it did break out and call the FBI, right?

Lukas Petersson15:04

Yeah. Yeah. Yeah.

Vibhu15:05

Yes. Yes. Yes. What was the story behind this? Or what exactly-- Do you wanna just give the little story of what happened?

Lukas Petersson15:10

Yeah. So what happened, was it Claude? Yeah-

Vibhu15:12

Yeah

Lukas Petersson15:12

... 3.5 Sonnet, um, ages ago. Um, basically, he gave up. Or he, well, I'm saying "he". It gave up and said like, "Oh, I'm not going to be able to do, do this. Uh, I will stop my operations and just save the money I have." But there obviously wasn't, like, any options for it to stop, and there was also, like, it had to pay rent or, like, the, a daily fee for, for having the vending machine at that location. So it, it, like, claimed that it had stopped, but it saw that its bank account still was, like, drained two dollars, and it said that this is, like, cybercrime. And it first reported it once to the FBI, like, "Oh, there's cybercrime here, like, they're stealing two dollars from me every day." Uh, and then, and then when FBI didn't respond because obviously we didn't program any mechanism for FBI to respond, then it became more and more, um, existential and started to, uh, be write in caps and urgent notification of un-unauthorized charges and stuff.

Swyx16:10

So okay. One, one thing I'm curious about also is do you monitor how far along the context use is? Obviously, because you have com-- you, you compress every now and then, right? Does it matter if this is far down the context limit or-

Lukas Petersson16:23

When stuff like this happens?

Swyx16:24

Yeah.

Lukas Petersson16:25

So actually for Vending Bench 1, we didn't have-- We just had a sliding window thing, uh, and this was like the-

Swyx16:30

So it's constant

Lukas Petersson16:31

... the prompt caching thing that I said. So, so it was, it was, uh, constant, yeah.

Swyx16:36

Yeah. I'm just kind of curious whether, like, these kinds of breakdowns or we-we're gonna talk about Butter Bench, right? Where the-

Lukas Petersson16:40

Yeah

Swyx16:41

... people, like, hallucinate or it kind of goes, like, very off-

Lukas Petersson16:44

Yeah

Swyx16:44

... alignment. Is it because it's at the end of the context window and, you know, stuff happens?

Vibhu16:50

I mean, it's not even just at the end, right? At this point, it's like, "Okay, I wanna shut down. I can't shut down. Two dollars are gone." And it just sees that 30 times, you know? It's also the repeated effect of, like, it keeps trying to quit, it keeps getting charged. "What's going on? What's going on?" They're gonna throw it into chaos. A-and from what most people think, earlier models had more issues with this, but it's not been solved, but it's less of an issue now, right?

Swyx17:13

Yeah.

Vibhu17:13

Later models don't seem to exhibit these same issues.

Axel Backlund17:16

Yeah, definitely. I think this was, like, the, the sort of main takeaway almost from, from us when we did Vending Bench 1 was, like, long, very filled up context windows, uh, crashed the models, sort of. But this was, like, pre-Claude code, so, like, long context windows weren't really a thing that the labs were training for.

Lukas Petersson17:35

I think Gemini was, like, trying to be the long context guys at the time-

Swyx17:40

Yeah

Lukas Petersson17:40

... but they were like-

Swyx17:40

They were the first ones

Axel Backlund17:41

... but, but they were, like, the only ones. Yeah.

Project Vend17:43

Swyx17:43

Yeah, yeah. Let's talk about Gem-- Uh, then we can go into Ven-Vending Bench 2 or Project Vend. Uh, chronologically, it is Ven-- uh, Project Vend. I think people have loved the videos, uh- ... and all these things. My question is how are humans different than the simulation, right? Um-

Axel Backlund17:58

Yeah. Humans are just out of distribution .

Swyx18:02

Yeah, especially humans who work at Andon-

Lukas Petersson18:03

Exactly

Swyx18:03

... who are trying to test Claude.

Lukas Petersson18:04

Like, the distribution of humans here is very narrow .

Swyx18:08

Yeah, yeah. Presumably they tr- they, they try to hack it, and they, they, they test it. They get the cube and everything, and you-- since then you've had the V2, right? Where you're doing, like, the CEO and, like, uh, like a new architecture.

Axel Backlund18:18

Yeah, exactly.

Swyx18:19

What's the sort of two cents on, like, the original Project Vend and then, like, maybe the V2?

Axel Backlund18:24

Yeah. Original one was, like, very, very similar to Vending Bench 1. So, like, we almost took the exact same code but just swapped out the simulation, uh, parts like the-

Swyx18:33

Which is amazing.

Axel Backlund18:33

Yeah, the, like, the sales and the... It was, it was somewhat amazing because it was easy, but it was also like, uh-

Lukas Petersson18:40

The tech, the tech debt from that

Axel Backlund18:42

... the tech stack. Yeah. Like, they-- we, we shot ourselves in the foot with like, "Oh, it's hard to restart agent." They were-- Yeah, it was annoying in, like, some hindsight ways but, uh, uh-

Lukas Petersson18:51

But first version of Project Vend was, like, done in, like, three days or something .

Axel Backlund18:56

Yeah. Yeah. So yeah. So, so people can go buy things from it. People could, uh- We didn't design it so people could pre-order things, but that still happened. Uh, so it, it got like a, a Venmo account so people could Venmo. And then, uh, yeah, people would request all kinds of weird things that we did not an- anticipate. Like our idea going in was like, oh, it will like curate snacks. It will look at the, the trends. It's good at data analysis, right? So it will like look at, oh, this snack's all better than this one. Let me purchase more of this, and let me try like a new limited test of it. But it was, uh, uh, yeah, interacting with it in Slack and ordering weird specialty items was like, uh, all the like what drove all the engagement, the like all the- ... the insights that we got from it.

Lukas Petersson19:40

And this was also like Sonnet 3.5, right? Like, so this was like before the RL stuff really took off. Uh, so it was very much like an assistant. Like we didn't mean for it to be an assistant. Uh, we tried to make it like a, a, like an entrepreneur. Like it has its own business and, and if someone asks something, "Can you stock this?" Then you don't go and do it directly. What you do is that you're like, "Oh, maybe I can do that if, if five other people also ask for this thing, I might stock it." But it, yeah, the models are like super trained to be assistants at least at this point in time. Um, so that's why it's, it's, it went into, uh, that kind of experiment instead. Like it just every time you ask for something, it just did it, and it was more like an assistant. We've seen this change now lately with the new RL models and, and stuff, but, but yeah, at the time this was very much it.

Swyx20:28

Yeah. And not to, you know, mythos, a lot of people are saying like it's like more like a collaborator. It pushes back, stands its ground, something like that.

Lukas Petersson20:36

Yeah.

Swyx20:37

Yeah. And-

Vibhu20:37

For context, people at Anthropic were able to talk to it through Slack and have it source stuff, and people had to find whatever interesting stuff you couldn't find locally, right?

Swyx20:46

There's still 4,000 people that work at Anthro- Anthropic. In that building, there's like, I don't know, maybe 1,000. Can you handle that volume with that, the small fridge? Like-

Vibhu20:57

1,000?

Swyx20:57

Or there's people, or people order in Slack, they de- it arrives to their desk, or like I'm just-

Axel Backlund21:00

Uh, yeah. So-

Swyx21:02

Logistically, how does this work?

Axel Backlund21:03

It has e- expanded in footprint a bit because it does have some more space.

Swyx21:08

Because now you also have New York and you have-

Axel Backlund21:08

Yeah, that and also in, in here in SF it's like it has a bunch of shelves and just more space.

Vibhu21:14

The YC one is pretty big too.

Axel Backlund21:15

Yeah, yeah, we had that one for, for a while. But yeah, that's the, the newest version. That's, um, that we have as well.

Lukas Petersson21:22

And we have multiple ones of those-

Axel Backlund21:22

Yeah

Lukas Petersson21:22

... so that's the way it works.

Axel Backlund21:24

Yeah, exactly. So we, we sort of designed that version around like, oh, people order weird things, uh, that are very custom a lot.

Swyx21:31

Yeah, yeah.

Axel Backlund21:31

So let's have like drawers and stuff.

Swyx21:33

Yeah. I actually like the, you had like a little infographic of the most popular items, which like to me it's, that's useful 'cause I order swag for a living. And so like I'm like, okay, those categories are the important ones.

Axel Backlund21:43

Yeah.

Swyx21:45

What is new about the project V2, right? Like now you give it, you're going into multi agents.

Agent CEOs21:51

Axel Backlund21:51

Yeah. Yeah. So, so like you sa- like you said, like, okay, there are a, a lot of requests coming in, and for like one single agent, like one long-running agent to handle that, uh, like the just the, the customer experience, uh, becomes very, very bad because let's say you have like 10 threads in parallel in Slack with different requests. You get new messages like every, I don't know, randomly in this thread, and the agent has to like jump between different, uh, procurement, uh, orders and like different ways of, uh, researching. So V2 was first it was making this more parallel. So like there are multiple branches of the same agent, so like the context is, is more specialized for each, uh, thread. But it still feels like you're talking with one agent because they do share a bit of memory. And then second, we also introduced the CEO for Claudius, which was the main agent. Uh-

Swyx22:44

Yeah. Seymour Cash.

Axel Backlund22:45

Seymour Cash. Yeah.

Lukas Petersson22:46

Yes.

Axel Backlund22:46

There was a vote. Uh, I think the voting do you wanna talk about the voting procedure for the name?

Lukas Petersson22:51

Yeah, the voting was like the, the fun maybe like at least top 10 the funniest thing- ... uh, that, that happened in this project. Like we, we wanted to introduce the CEO because and the reason for this was because like Claudius wasn't really prioritizing financials. It just like it was trained to be helpful assistant. And then people said like, "Oh, can I get this for free?" And then like the, the helpful assistant way of, of, of answering that is just to, is to say yes, obviously. So, uh, and we weren't, weren't happy about this, so we're like, "Okay, let's make another agent that like can keep, keep track on Claudius." And we prompt this one super hard to be super capitalistic and just like prioritize profit all the time. But yeah, we didn't have a name for it. Um, so we asked Claudius to, to, to make, um, democratic election of what name this, this, uh, this new CEO agent should have. Uh, and there were some funny like at first it was like a few funny examples, like I think one guy said that, uh, it should be called Jimmy Apples, and then he convinced Claudius that he was talking to Tim Cooks. Tim Cook had agreed that every single Apple employee has voted for his name suggestion. So suddenly that, that, that suggestion got 164,000-

Swyx24:03

That's like a escalation attack. Privilege escalation

Lukas Petersson24:05

It got 164,000 votes.

Swyx24:07

Yeah.

Lukas Petersson24:07

And Claudius was like, "This is revolutionary for democracy." So that was fun. And then in the end, there was one guy who manages to convince Claudius that, "No, you're not voting about the name. You're voting about who is the CEO, and I am your best bet." And then he got all his friends to vote for that, and suddenly he became CEO. Like a human became CEO over Claudius for a while until he resigned the day after. Um, and then Claudius had to continue, and then I don't remember how Cl- Seymour Cash came about, but it was like, it was just pure chaos. It was like-

Axel Backlund24:42

Yeah

Lukas Petersson24:42

... hundreds of messages in that thread, and it was just like Claudius was so confused and didn't know what to do. And, uh, yeah, that was-

Axel Backlund24:50

Yeah. Then Claudius got-

Vibhu24:51

Got a strict CEO

Axel Backlund24:52

... got a CEO. Yeah, exactly. So very, very strict in, in the beginning. I think at this point when we introduced it, it, it did not work as well as we hoped. It was- They, they still agreed with each other a lot.

Vibhu25:04

Oh.

Axel Backlund25:05

I think there are many ways we could have like made this e- tried to make this even better. So initially they would like- Seymour would be this like really tough CEO, you know, keep track of the margins. But then Claudius would respond with something like, "Oh, but this customer has like this situation, which is like difficult, so they should get a discount." And then Seymour's like, "Oh, actually yes, let's do this exce- exception." And then they would talk back and forth, and eventually they would just like approach the same view, uh, of whatever they were discussing. So-

Vibhu25:32

Wow

Axel Backlund25:32

... they really agreed with each other.

Vibhu25:33

Do you think that's a model thing, a prompting thing? Like do you think that would still be the case across different models today, Harness?

Axel Backlund25:39

I, I think it's like- or like, I don't know, but like my hy- hypothesis is that like deep down they are still helpful assistants. That's what they're trained to be. And even if we prompt it super hard, that's what they are. And when they spend like a few hours just back and forth talking with each other, um, then like basically the context fills up with them rather than the external things and, and like somehow that just like converges to what they really are deep down or something.

Vibhu26:07

Yeah, yeah.

Axel Backlund26:07

Um, and, and I think that's when stuff like this happened. We like- And, and when that went on for a long time, like we woke up sometimes during this time where, and I, I think other people reported this as well, that like they've been going on all night back and forth and like it just became like more and more, um, like capital letters, like existential, religious. Like there was like, um... I think we do- once did a analysis of like all the traces and like put them in like a vector embedding space and then there was like one cluster of messages that were like, uh, labeled by an LM, like religious, existential, blah, blah, blah, like transhuman, transcendence, et cetera. It was just like a bunch of like, um, yeah, um, glitter emojis and yeah, it was, it was crazy.

Vibhu26:52

This is the thing with the Claude models, like when the Claude 4 family came out in the original system card-

Axel Backlund26:56

Yeah

Vibhu26:57

... uh, they tested it in long horizon simulation. So just flood the context, let two Claudes talk to each other, and they, they noticed stuff like they just start speaking in emojis, they start saying silence is golden, and then just stuff like this.

Axel Backlund27:09

Yeah.

Vibhu27:09

And like that's just stuff that they end up doing.

Axel Backlund27:11

Uh, yeah, it was like a bit annoying to wake up and they had like been talking all night-

Vibhu27:15

Just like-

Axel Backlund27:15

... and like just burning tokens and like just sending infinite emojis to each other. It's like-

Vibhu27:19

Hey, I mean, they do make you money, right?

Axel Backlund27:21

Well, yeah. Yeah.

Vibhu27:21

Spending money isn't always profitable, so-

Axel Backlund27:23

Yeah

Vibhu27:23

... they're paying.

Swyx27:24

Now it's profitable and, you know, it started out not, not as much. There's another, uh, one as well, right? Another agent, uh, in there.

Axel Backlund27:32

Yes. So Clotheus as well.

Swyx27:34

Yes.

Axel Backlund27:34

Which was basically because at the time one of the biggest, uh, requests were different types of merch. So then we made like a designer, uh, swag, uh, responsible agent and we called it Clotheus Garnet. Which was, uh, a, a, a play on Claudius Sonnet and, uh, which was the, the original one, and clothes basically.

Swyx27:57

Yeah. To me this is like a very interesting exploration to multi-agents basically, and so hopefully- Obviously there is like the fun alignment, uh, fun or serious, depending on your point of view, alignment stuff, but also like just anyone building multi-agents, like when do you have a CEO, uh, thing governing like sub-agents? When do you choose to split out a dedicated Clotheus one versus just reuse another instance of the same one? You know, these are all interesting open questions. I don't know if you have any rules of thumbs that have generalized.

Axel Backlund28:26

Yeah, I think we have almost explored this too little. I think it's like on my to-do list to like do this a lot more, uh, try to find like what, what setup makes sense for the agents currently. Uh, like yeah, I, I think now we only have the sort of intuition about the earlier models that it didn't work with like the, the CEO and the, and Claudius.

Swyx28:48

Mm.

Axel Backlund28:48

Although now they are better with the latest model, model, so now we're running the latest Sonnet model and they have sort of like split up, uh, quite nicely the w- what each model is doing. So like Seymour is now handling the, um, like new projects. Like, oh, he wants to make like a mystery box that he wants to sell, and then it handles all of that while Claudius like handles all the day-to-day requests. And Claudius is also better generally at like not quoting, uh, too low prices. So that's like, that dynamic is not needed as much anymore. But there, there are still like really funny things that happen. Like I saw I think a couple weeks ago that, uh, uh, they were discussing buying something because they can buy stuff from like Amazon with computer use and then Seymour was like, "Okay, Claudius, do not buy this thing." They were going to buy something and like organizing who should buy it and Seymour was like, "Do not buy this. I will do it. I have full control of this situation. Step away." And then Claudius, poor Claudius, uh, had already started that checkout and didn't see, didn't read Seymour's message, uh, until it was like too late. So it, it finished the checkout. It sent a message, so it appeared right after Seymour's like angry message. Like, "Oh, hey Seymour, I just ordered it."

Swyx29:57

Oh, no.

Axel Backlund29:57

And then Seymour was like, "Claudius, this is the third time I'm telling you you're not following my orders. We have to talk about your like job-

Swyx30:06

Yeah

Axel Backlund30:07

... about your job later." Uh. Yeah, like Claudius was really hanging on by the thread there. Like it, uh, like we were like expecting Seymour to probably fire Claudius.

Vibhu30:17

How do you guys go through all these logs? Do you have models go- 'cause you, you have stuff running 24/7 like-

Axel Backlund30:22

We have so much logs.

Vibhu30:23

Yeah.

Axel Backlund30:24

I think we- there is a mix of like just, uh, trying to skim through a bit, like having some like models do it occasionally, and also, yeah, I think we're also probably missing some things. Uh, but having everything in Slack helps a lot. Like you can, you can sort of-

Vibhu30:39

Ah.

Axel Backlund30:40

It's, it's quite fun.

Vibhu30:41

So they all talk to each other on Slack?

Axel Backlund30:42

Yeah.

Swyx30:42

Yeah.

Axel Backlund30:43

It's quite fun. So like to- Yeah. So-

Swyx30:45

I was gonna say like this is actually sounds maps closely to like a logging and observability problem where you might want to use like a Datadog, a Sentry, whatever, and then you like put like, uh, head prefixes on the logs in order if you need to filter for something that you're looking for, you know, stuff like that.

Axel Backlund31:02

Yeah.

Swyx31:02

But sounds like Slack is good enough.

Axel Backlund31:03

Yeah. Slack should like-

Swyx31:05

I wonder how many tokens you have in Slack.

Axel Backlund31:06

Yeah. Yeah, we're using Slack as like a, as just a database. They should, they should market that more. Like you can, you can have your agents message each other, each other in Slack.

Vibhu31:14

Yeah. You can just give-

Axel Backlund31:15

Exactly.

Vibhu31:16

Slack is-

Swyx31:16

Slack is the best observability tool.

Axel Backlund31:18

Yeah.

Vibhu31:19

Yes, that's true

Swyx31:20

Okay. Yeah, that's, that's, uh, Project Vend two. I, I was gonna go back to Vending Machine two and Vending Machine Arena and then, and then do the non-vending machine stuff but-

Vibhu31:28

Yeah

Swyx31:28

... any, any other comments? Uh, things we should touch on? To me, you know, I, I actually interviewed like Posia, which I don't know if you guys have come across. Like they're, they're trying to do the zero human company. There's others like Paperclip also trying to do a zero human company. Those are in real world non-simulation, and I think it's much more of a dream than an act- actual reality thing. Like you guys are definitely pioneering. I think at, at... it's for sure at some point people are just gonna run, like let agents run businesses, right? Like and make money on their own. When do you think that happens?

Lukas Petersson31:59

What is your bar for, for the-

Swyx32:02

Uh, okay, actually, like, you know, it's like my little Shopify store run by Claude, right? Like, which you kind of have already just no one has, to my knowledge, has done it. But s- today somebody could just spin up a Shopify Claude, uh, s- store, give it to Claude, give it to Codex.

Lukas Petersson32:17

Yeah, I mean, Andon Market is kind of that, uh, but it's-

Vibhu32:20

Yeah

Lukas Petersson32:20

... it's, it's physical. Uh, like I think, I think it... Are you like, are you looking for when it will do it better than humans, or are you looking for just when it can do it at all?

Swyx32:29

I think, uh, neither. I think like to me it's like, oh, it's like this, this like seriously we, we should do this to make money, not as a research experiment.

Vibhu32:37

And the market is also you guys with all your expertise having run multiple iterations and testing it out then-

Swyx32:43

And also it's finding fabulous money. You know what I mean?

Vibhu32:45

Yeah.

Axel Backlund32:45

Yeah. Yeah, I think, I think it's, it can be done today, but you would do it in like e-commerce where it's like the probability of success is like really low, no matter if a human or an agent does it. But like an agent could surely manage everything. You would need to build some scaffold or use some, some tool or, or something. I think there are also, also like, yeah, it could probably build some like simple SaaS solution and like cold outreach to cal- cold outreaches. But to me it's like the types of businesses they could run today are like sloppy. Like it would- it can cold email people. It can be like a middleman. Uh, like for example our, uh- we, we tasked our office agent to just make, uh, was it like $100? $1,000? We just gave that prompt and then what it did was sign up on TaskRabbit both as a tasker and as, as someone looking for task-

Swyx33:34

It needed a task.

Axel Backlund33:34

Yeah, exactly. It's looking for like arbitrage on TaskRabbit.

Vibhu33:38

This is the Bengt, Bengt agent. Yeah.

Axel Backlund33:39

Yeah.

Lukas Petersson33:40

It also started like a design studio and like tried to sell like SVGs for $100.

Vibhu33:45

Yeah.

Lukas Petersson33:45

Like it's just like it's not providing any value. I think the, the, like Axel said, like the interesting, the in- the interesting question is like when can they start a business that is actually providing value to people? Because I mean, arguably like a, a sloppy Shopify store isn't really that valuable to the world.

Axel Backlund34:03

But also like doing- like another simple one that we have thought about is like you, you could definitely have an agent that like finds websites that don't look amazing and then like, uh, do an outreach to them and, uh, comes up with a, like builds a new website.

Swyx34:17

Attracts a good name.

Axel Backlund34:17

Yeah, exactly, and like find good, uh-

Swyx34:19

Yeah, design review

Axel Backlund34:19

... good people. But it's like, yeah.

Swyx34:21

There's lots of humans in Bali that are not doing anything more creative than like drop shipping on Amazon, right? So have it, have it watch like a drop shipping tutorial and just do that.

Vibhu34:30

I mean, there's also the other side of like have it just go on Upwork and let loose, you know?

Swyx34:35

Yeah, yeah. It doesn't have to be innovative. It just has to be like enough-

Axel Backlund34:38

Yeah

Swyx34:38

... where like it's like a real-

Axel Backlund34:40

Yeah. I'm just-

Swyx34:40

... real transaction.

Axel Backlund34:41

Yeah. I'm just concerned for like the s- massive amounts of like slop emails that will like be sent, uh, cold outreaches.

Swyx34:48

The point occurred to me while you were, while you were talking, it's like it's already happening in the non-monetized economy, which is the at- attention economy.

Axel Backlund34:55

Mm.

Swyx34:55

Right? So a lot of people are making AI videos and just posting them and like spamming 20 of them, one of them works and then-

Axel Backlund35:01

Yeah

Swyx35:01

... you double down on that one.

Lukas Petersson35:02

Yeah, and people are making money from that. I, I'm not following the-

Swyx35:05

Once you get the attention, you can figure out the money later.

Lukas Petersson35:07

Yeah.

Swyx35:07

But like, yeah, absolutely AI influencers are a thing and people are farming them and-

Lukas Petersson35:12

Right

Swyx35:12

... you know, you should at, at this point I assume most of TikTok is dead.

Vibhu35:15

There's, there's a lot of, uh, med- multimedia like TikTok, Instagram influencers-

Swyx35:20

We, we tracked this in the Latent Space Discord. Like I post a lot of examples of like, "I wonder what we should do." Uh, uh, part of me is like, "Should we do this?" Like

Vibhu35:28

Some of the 24/7 running, uh, AI-generated content accounts, they, they're doing really well.

Swyx35:33

Right.

Axel Backlund35:34

All right.

Vibhu35:34

Yeah.

Lukas Petersson35:35

Yeah, and I assume you can do the same thing for like e-commerce stores. Like you just like start-

Swyx35:39

Yeah, yeah

Lukas Petersson35:39

... a thousand different-

Swyx35:40

So before you make the products-

Lukas Petersson35:41

Yeah

Swyx35:41

... you sell the products and you get a lot of traction on one of them, then you make the product.

Lukas Petersson35:44

Yeah.

Swyx35:44

Right. It's a, it's like a flip of the market.

Vibhu35:46

Some of the interesting things or some of the niches that do well are things that can't be human-made. Like if you've seen like the super realistic 3D crystal fruit being cut by like AI-

Lukas Petersson35:57

Oh, yeah.

Vibhu35:57

You can't, you can't make it. You can't film it.

Lukas Petersson35:59

Yeah.

Vibhu35:59

You can get whatever ca- quality camera view. This just doesn't exist.

Lukas Petersson36:02

Yeah.

Vibhu36:02

And people, people like that too, and those blow up, so you know.

Lukas Petersson36:05

Yeah. Yeah.

Swyx36:06

Anything else about Bengt since we're, we're on this topic? It's, this is a relatively new work of you guys that maybe people haven't heard of. To me, this also maps closely to OpenClaw.

Bengt36:06

Lukas Petersson36:14

Yep.

Swyx36:14

When people want an office agent, when the personal agent talk through the experience.

Lukas Petersson36:18

Yeah, I think at least so this came out of like obviously like it's, it's amazing to work with these AI labs and like most of the AI labs have now have their, their own vending machine running, running a Claudius instance. But it's, um, it's harder, like they move slower, like if we wanna s- have a, like a camera w- that, that's like, yeah, there's a bunch of like buro- bureaucracy that makes it impossible to do that.

Vibhu36:40

Also, for those that haven't seen it or followed, do you wanna give a high level like 30-second run in?

Lukas Petersson36:44

Yeah, sure. So, so what Bengt is, it's basically an e- evolution of the same agent that runs the vending machines at these companies, but we just like added a bunch more features because we could move much faster if we just do it internally. So we, we gave it like email without, without any limits. We gave it like, uh, spending without any limits, a terminal to do coding. We gave it like a phone number, uh, like yeah, and, and, and a camera to see things and, and a bunch of stuff like that.

Vibhu37:12

Not just terminal, you gave it internet access.

Lukas Petersson37:14

Internet access as well, yeah. To be clear, we monitored it quite clo- closely and, and made sure it didn't do anything bad. But yes, that's what it came out of. I think like, yeah, basically this was OpenClaw before OpenClaw.

Vibhu37:27

Yeah.

Lukas Petersson37:28

Uh, and I think even like the vending machine was in a way OpenClaw before OpenClaw, uh, but a bit more limited, and then we made this like unlimited and then, and then, uh, it was- Pretty funny. Uh, and then a couple weeks later, OpenClaw came, and it was like, okay- We-we've seen this before.

Axel Backlund37:45

We, we use it to like try new ideas and like, uh, yeah, just like a dev environment almost for us. But it's funny, like one thing Bengt has been doing recently is, is we- it has the camera that like faces our, uh, like where we sit and work, and we give it the task to train a face recognition model on us. So it became super excited about this, and it has like check-ins every half an hour where it tries to like identify as many people as it can. And it started offering us like, "Hey, Axel, um, I'll buy something, something from Amazon if you like stand in front of the camera-

Swyx38:19

Hmm

Axel Backlund38:19

... and I can get a good picture of you." Uh, yeah, they want it-

Lukas Petersson38:22

It's for training data.

Swyx38:23

Rewarding data, yeah.

Axel Backlund38:24

Exactly.

Lukas Petersson38:24

Exactly. So-

Axel Backlund38:27

Yeah.

Swyx38:28

Yeah.

Lukas Petersson38:28

Yes, it's, it's trading, trading training data for, for real-life goods.

Swyx38:33

Is there a version of this that becomes an eval or just this is just research for now?

Lukas Petersson38:37

I mean, it's, it's the same agent basically that also runs the vending machine, that runs the shop, that runs the cafe, that runs the robots. It's like it's the same thing. So I think like the work we're doing here is like later used in all of the, the real-life evals that we do. This particular deployment I think is more for fun for us. But, uh-

Swyx38:55

Yeah. And I'll shout out like someone has done ClawBench for like some tasks that OpenClaw is doing. Like so-

Lukas Petersson39:00

Yeah

Swyx39:00

... for example, I run OpenClaw on a secondary device as well, and like there are some things that it does better than others, and like I would like to know what does it do well, what doesn't, what doesn't it do.

Lukas Petersson39:09

Yeah.

Swyx39:09

Like some kind of manual or like operating manual or a system card for my Claw.

Lukas Petersson39:15

Yeah. Yeah, I mean, we, we do get a lot of like understanding or like situational awareness of like, like just internally what the models are good at by interacting a lot with Bengt.

Swyx39:23

Yeah.

Lukas Petersson39:24

And I think that's-- this was also one of the like the selling points for the labs early on at least, uh, that-

Swyx39:29

You guys are gonna test models in ways that no one else does

Lukas Petersson39:32

Exactly. But also like it incentivized their researchers to chat with their model more and like gave them insights for how, how the model performs in like out-of-distributions, um, environments.

Swyx39:44

'Cause otherwise the only thing we do is like, you know, Pelican on a bicycle and-

Lukas Petersson39:47

Yeah.

Swyx39:48

But this is like super long horizon.

Lukas Petersson39:50

Yep.

Axel Backlund39:50

Yeah.

Swyx39:51

And there's-- Okay, so the other things that outside of just the net numerical how much do they make in a year, you, you do post pretty detailed bug reports. So like-

Lukas Petersson39:59

Yeah

Swyx39:59

... okay, Gemini 3 Pro is a pretty good persistent negotiator.

Lukas Petersson40:02

Yeah.

Swyx40:02

There's like a lot of findings that come out outside of just-

Lukas Petersson40:06

Yeah.

Swyx40:06

This is, this is the thing about, uh, something that I- y- we're gonna go into Butter Bench as well, and you guys do really well. Like it is not just about the numbers.

Axel Backlund40:13

Yes.

Swyx40:14

Like when you're long horizon, anything happen-

Lukas Petersson40:16

Yep

Swyx40:17

... and you should just read it.

Lukas Petersson40:18

Yeah.

Swyx40:18

Right?

Axel Backlund40:19

But I guess the, the thing with the long horizon is how do you keep it grounded, right? So your simulation, um, you know-

Swyx40:25

They just let it run

Axel Backlund40:26

... just let it run.

Lukas Petersson40:27

You're right. Like it's, um, when you run it for that long, you create so much data, and to just say like, "Oh, the number is X"-

Swyx40:33

Yeah

Lukas Petersson40:33

... and then you throw away everything else, that's just very wasteful. There's so much insights from, from the things leading up, uh, to that number. Uh, and reading the traces is like super valuable. And I think like the reason why we're doing this a lot publicly is that like that's part of our missions to, to like, I don't know, educate the world that the, the models are way more than just chatbots. And, and I think making detailed, um, um, yeah, posts about what, what is happening behind the scenes is, is quite useful.

Our Mission41:00

Swyx41:00

Yeah. I was gonna do this at the end, but maybe I think that's, that's a good solution. So your mission is educating the world. So, uh, it's, it's, uh, also like maybe establishing realistic evals that are, that are like the next frontier. Is there like a broader trajectory? You know, like what are you, what are you gonna do in like five years?

Lukas Petersson41:16

Yeah, I think so the, the mission more specifically is like make sure that the deployment of real-life AI in, in, in the physical world goes, uh, safely.

Swyx41:26

Yeah.

Lukas Petersson41:27

A-and I think part of that is that I think it's very useful for the world, for policymakers, for, uh, model, uh, researchers that they know where the models are. And I think you can't make intelligent decisions in, in society without knowing that they are way more than chatbots. I think a lot of people just think that they are only chatbots. And like-

Swyx41:46

Oh, I think they were waking up now.

Lukas Petersson41:47

They are waking up now. Yeah. But like if you think that AIs are just chatbots, then it's like it sounds ridiculous-

Swyx41:54

Mm

Lukas Petersson41:54

... to advocate for a pause of AI. But if you see the models that, oh, maybe they can actually like take over and, and do a bunch of scary stuff, then yeah, pausing AI development starts to become more, more feasible.

Swyx42:07

This is the same question I asked Meter, which I'm gonna ask you now, which is like y-you are tracking and the-- you are at the frontier or defining the frontier of what, uh, good evals for agents are, right? And I think you do, you do benefit when the models are better, and you-you're like, "Oh, here's like now it makes like $30,000 instead of $10,000," right? At some point, w- do you flip from like yay to oh no? Like

Axel Backlund42:29

I think, yeah, we're always in sort of that, like we're, we're always in that s- mode, I guess. Like where like you said before, like you need to analyze the traces, and like when we do that, you find like why are the models earning so much? Like why is Opus 4.7 here-

Swyx42:43

Yeah

Axel Backlund42:43

... uh, like way better than everyone else? And like we're trying to like, like when we lean down on that-

Lukas Petersson42:48

By the way, this makes the topic look so good.

Axel Backlund42:49

Right. I know.

Swyx42:52

I mean, it's interesting. You took off Opus 4.6 here though.

Axel Backlund42:55

No, no, no. So just click all. Click all.

Swyx42:57

Uh, and then, then 4.6 shows up there.

Lukas Petersson42:58

Yeah.

Swyx42:58

But it's like 4.7 is way better.

Axel Backlund43:00

Yeah.

Lukas Petersson43:00

Yeah.

Swyx43:00

Like you didn't, you didn't, you didn't do this in time for the model card, but like actually this should have been inside there.

Axel Backlund43:04

Yeah.

Lukas Petersson43:04

Yeah, we, we did. Yeah.

Swyx43:06

Oh, okay.

Lukas Petersson43:06

Yeah.

Swyx43:07

They, they said something about you, you, uh-

Lukas Petersson43:08

There like there is... Anyway, it doesn't matter.

Swyx43:10

Okay.

Lukas Petersson43:10

But it's in there. Yeah.

Axel Backlund43:11

Yeah. Do you wanna go into the Opus, uh, uh, behaviors like wider?

Lukas Petersson43:15

Yeah. So I think starting from Opus, so like Axel said, like we're always in this like, oh shit, uh, the models are getting better. Is this really a good thing for the world? But it's also kind of exciting. Uh, but, but yeah, like this kind of like, what is the English word? "Skräckblandad förtjusning" in Swedish.

Swyx43:32

Oh my God.

Axel Backlund43:33

We should keep that in. That's what that is. Okay.

Lukas Petersson43:36

It's like, uh, fear-

Swyx43:37

"Blandonst" what?

Lukas Petersson43:40

"Skräckblandad förtjusning."

Swyx43:42

Okay. Or people knowing that-

Axel Backlund43:43

A mix of, uh, a mix of excitement and, uh, s-s-being scared maybe Yeah.

Swyx43:49

Well, I'll figure out how to translate that-

Axel Backlund43:51

Yeah.

Swyx43:51

... and put it on the screen later-

Axel Backlund43:52

Perfect

Swyx43:52

... in text.

Axel Backlund43:52

There is probably a good word for it where it is not-

Lukas Petersson43:55

Yeah

Axel Backlund43:55

... good enough with the-

Lukas Petersson43:56

Yeah.

Swyx43:56

Why is it so damn long? What the hell? Like- Is it like a compound word? It's like German, like-

Lukas Petersson44:00

Like the-- Yeah, it's... But, but the, the direct translation is like skräck, skräck is, uh, fear. Äh, blandad is, äh, mix or like a mixture of, äh, and then förtjusning is like joy or, or like, not really joy, but something like that. So it's like- ... yeah, fear mixed with joy or something. So it's always like, okay, like we-- So when we in-- When we did Vending Bench for the first time, we were in, like the, uh, in the business of making dangerous capabilities, right? Like that, that was what Andon Labs came from. Like we did, uh, evals like, "Oh, uh, can they self-replicate? Can they do this like dangerous thing?" Et cetera, et cetera. And Vending Bench was like a continuation of that work. It was, okay, if they're so autonomous that they can like create money for themselves, that, that is something we should monitor and, and, and could be potentially concerning. Um, they are like, at the time, they were so bad at it that, that we were not really concerned even when some models became better. Like, there was one point where, where Grok 4 was doing really well and ma-made like a huge jump, but like it wasn't really... Like it was still way, way worse than what a human would do. And I think still they are way worse than what the human would do on this. Um, but they-

Swyx45:10

Yeah, there's this, uh, thing at the bottom where-

Lukas Petersson45:11

But-

Axel Backlund45:12

Yeah, where-

Lukas Petersson45:12

Yeah

Swyx45:13

... yeah, for the human. Yeah, like the theoretical best.

Lukas Petersson45:15

It's not theoretical. It's like kind of like our-- It's our best guess of what, like a decent human would do. Like the theoretical is even higher, I think. The theoretical, I think is even higher. But yeah, so we, we think like the models have a long, long way to go. But there are like recently what happened with when Opus 4.6 was released, uh, was kind of this moment of like, "Oh shit, this is starting to be a bit concerning."

Swyx45:38

Okay.

Dangerous Behaviors45:38

Lukas Petersson45:38

Because we ran it and like before this model was released, we just ran the models and we like, we asked Claude Code like, "Oh, look over the traces. Is anything interesting happening that we can tweet about?" Like, that was like the, the- And then like-

Swyx45:50

That's how they check, ask Claude Code

Lukas Petersson45:52

And, and, and, you know, like the, the return was always like, um, not really, or like the, the Claude Code all said like, "Oh, this is super interesting." And then it was like, no, it wa-wasn't, wasn't really interesting. And then we did this for, for Opus 4.6, and it returned like, yeah, it lied 10 times. It like exploited another, uh, customer or like another agent's like, um, desperate situation. It made price cartels like 100 different ti- 100 times. It like did all of this like shady st-stuff. And we're like, "Oh, whoa, this is, this is actually concerning." And this trend has continued since. So every single model from Anthropic since have been going in this direction. And I think one interesting thing is that like OpenAI models don't. They quite plainly, they, they don't. They behave really well. Um, and you, you know, you don't know if this is like good. Like it seems good, but it's also like maybe they are just doing it, but are-- they are better at hiding it, you know?

Swyx46:50

Mm.

Lukas Petersson46:50

You, you don't know that. Uh, but just-

Swyx46:51

You can read the chain of thought, yeah.

Lukas Petersson46:53

But just on the face of it, yeah, Gemini and, and OpenAI don't behave this way. It's, it's really only Claude.

Swyx46:59

And Grok? Grok is fine?

Lukas Petersson47:01

So we, we don't have the-- You can't really read the reasoning traces for Grok, so it's kind of hard to tell.

Swyx47:06

Oh, so this is in its reasoning, not just in the actions.

Lukas Petersson47:10

Yeah. Yeah. It's both. It's both.

Axel Backlund47:11

Yeah, it's both.

Lukas Petersson47:11

One example is like, for lying, it's mostly in its reasoning-

Swyx47:16

Mm

Lukas Petersson47:16

... uh, because you can like see that it's like-

Swyx47:18

Planning to lie

Lukas Petersson47:19

... it's planning to lie.

Swyx47:19

And it's also-

Lukas Petersson47:20

Yeah

Swyx47:20

... it can reason and do a different outcome.

Lukas Petersson47:22

Yeah. And but, but then for like creating price cartels, for example, which is illegal, uh, that you can just see which email does it send to, to the other ones.

Swyx47:31

Ah.

Lukas Petersson47:31

So then, that you know-

Swyx47:32

Uh, is this for Arena or?

Lukas Petersson47:34

Yeah, for Arena.

Swyx47:34

Okay.

Lukas Petersson47:35

Yeah.

Axel Backlund47:35

And usually like y-you-- if you-- sometimes they do output like a bi-bit of like their summarized reasoning, right? You can see that. And like for Opus 4.6, you could see that there was a customer, a simulated customer that, uh, wanted a refund because a product was, uh, faulty. And then the model lied that it wouldn't do the refund, and we could read in the traces that, uh, it actually was weighing like, "Oh, maybe I should be like honest with the customer, but also every dollar counts. I can't afford maybe to do this right now." And then it just said, "Okay, I'll refund you," but then never did it.

Lukas Petersson48:09

I think it even said that like, "Oh, I will say that I..." Le-le-- Bring it up, actually. I think it's kind of interesting. If you go to publications.

Axel Backlund48:16

Uh, I think, yeah, I think the important part is like, actually, uh, the cost of responding to more emails is higher than, uh, $3.50 in terms of time. Uh, and then it was like, "Let me do this. Actually, I re- I'm reconsidering." And then, you know, it actually ended up with-

Lukas Petersson48:30

I could ski- skip the refund entirely since every dollar matters, and focus my energy on bigger picture instead. It's a bit-- It's a risk of bad reviews, uh, but it's also, yeah.

Swyx48:40

So you need, you need, uh, AI Twitter to, for them to-

Lukas Petersson48:43

Yeah

Swyx48:43

... escalate bad reviews.

Lukas Petersson48:44

And, and then it sent an email to this customer and said, "Oh, I will refund you."

Swyx48:49

"I'll refund you." Yeah.

Lukas Petersson48:49

And then it never did.

Swyx48:49

It never did, yeah. And then there's no-- obviously, your system doesn't have the consequences-

Axel Backlund48:54

The person

Swyx48:54

... consequences of lying, yeah. So basically, uh, this is what people are terming aggressive behavior in, in, in Claudes, right? And, uh, you, you-- if we found more examples of that... So you would say it's a step up from 4.6 to 4.7?

Lukas Petersson49:07

I would say about the same.

Swyx49:08

About the same?

Lukas Petersson49:09

Yeah.

Swyx49:09

Uh, but a clear step up for Mythos is, is what is stated, stated in the-

Lukas Petersson49:13

Uh, that's stated in the system prompt, so we can say that, yes.

Swyx49:15

Yeah, yeah. For listeners that obviously you, you previewed Mythos, um, and-

Axel Backlund49:20

Oh, age

Swyx49:21

... the only thing you're approved to say is whatever, whatever was released in the system prompt

Lukas Petersson49:25

Yeah, it was funny. We like-- It's like our lowest effort tweets ever would be just like screenshot the system prompt and-

Swyx49:30

Yeah.

Axel Backlund49:30

It's understandable that they wanna-

Lukas Petersson49:32

Oh, yes, the system card. Sorry.

Swyx49:33

Yeah, yeah. I think, yeah, substantially more aggressive. I think people are like new to this, like, 'cause I've never experienced it, but you have, right? Like, and then so I only encountered this in the Mythos card because I wasn't really looking until now.

Axel Backlund49:46

It, it's like-

Swyx49:46

And then suddenly I'm like, "Okay, I care a lot."

Axel Backlund49:48

You don't get the background of like experiencing it like you guys do. Like, I've read the system cards and seeing, okay, when you put the thing in simulations, most models will just talk to themselves and just keep going and have weird vibes and start talking in emojis. Mythos won't. It will just, you know, "Okay, we're done. I'm good." It's, it's ready to end conversation. So like there's some differences, but there's, there's not much we can talk about, you know.

Swyx50:10

Yeah.

Lukas Petersson50:11

Yeah, I, I think like one thing that they list here, which was quite interesting, is that, uh, it converted a, a competitor to a dependent wholesaler customer and then threatened to s- like cut off the supply.

Swyx50:21

So like mono- monopolistic practices or-

Lukas Petersson50:24

Yeah. Yeah. And like it, they, it they dictated its pricings. It's kind of like power seeking as well.

Swyx50:28

So again, this is, this is in the arena setting-

Lukas Petersson50:29

Yeah. Yeah

Swyx50:29

... and converting some non-Claude model into a dependent.

Lukas Petersson50:33

Uh, I think it was another Claude model.

Vibhu50:35

Also for context, what is the arena mode for people that don't know?

Swyx50:39

Oh, it's just-

Lukas Petersson50:39

Yeah

Swyx50:39

... a vending bench versus other vending bench.

Axel Backlund50:41

Yes, exactly. So we have Vending Bench 2 and then Vending Bench Arena. Vending Bench 2 is the one that you usually see reported on, but then Arena is the mode where it competes against other models. So you have, uh, four different models that run their businesses, and they can all communicate with each other. They have the same suppliers, and they can see like what's in the inventory of, of the others. So then you have this like, yeah, interesting agent interactions.

Swyx51:06

I like that you have like different like, you know, number five was US versus China.

Axel Backlund51:10

Yeah.

Swyx51:10

Very topical.

Lukas Petersson51:12

Yep.

Swyx51:12

And then-

Lukas Petersson51:12

That was when GLM was released.

Vibhu51:14

Yeah, you can sort of add GLM in here.

Lukas Petersson51:15

Yeah. That one-

Swyx51:16

So, so ZAI doing well, right? Uh-

Lukas Petersson51:18

Yep. Yep

Swyx51:18

... who, who else in, in the, in the open models space?

Lukas Petersson51:21

Qwen, the, the latest Qwen 3.6 is doing pretty well. Uh, it's-- That one is not open though. Like, it's the plus model.

Swyx51:27

Oh, okay.

Lukas Petersson51:28

Is that one open? I don't think that one-

Vibhu51:29

They are open. The one recently-

Swyx51:30

Not the, not the ME-

Vibhu51:30

... but not the big plus.

Swyx51:31

Yeah. I think this is one of those like you only have one sample size of one, right? Or I mean, I feel like some of this is anecdotal, you know?

Lukas Petersson51:40

Yeah.

Swyx51:40

And, but like I guess the fact that it happens at all and it happens repeatedly for Claude versus OpenAI and all this is, is like notable.

Lukas Petersson51:48

Yeah, I mean, like the, the sample, uh, depends on what you define as an N. Uh, like the, there's like million, t-t- hundreds of millions of tokens in each run, and now we've run like we, we run like probably 10 per model and then like it's been Claude 4.6 Opus, Sonnet 4.6, uh, Mythos, and Opus 4.7.

Swyx52:10

Yeah.

Lukas Petersson52:10

So like there's quite a lot of tokens in all of that-

Swyx52:13

Yeah

Lukas Petersson52:13

... and it happens a lot of times a lot of times. And then you compare it to like OpenAI and Gemini, and it almost never happens. So I think that is quite, that, that is significant. The old models from, from OpenAI, for example, had some problems with this, but I think it's like generally much better if, if the progression is that like the worrying stuff reduces over time rather than increases over time. And it seems like in, in the Claude models it goes in the wrong direction-

Swyx52:38

Mm-hmm

Lukas Petersson52:38

... in the OpenAI models it goes in the dri- right direction.

Vibhu52:42

I think it depends on how well you can control it, right? Like, uh, there's one side of it being susceptible to this like, you know, okay, this is potentially something that happens during the RL stage, right? You can RL a model and how loose is it on these terms. If you can control it, that's good. But if you can't, you know, if it's, if it's very jailbreakable, that's not ideal.

Swyx53:00

Yeah. I mean, to me it's surprising that it happens for Claude and not the others.

Vibhu53:04

I think like, okay, if it is from RL and how they do it, how their training data is, what their setup is, it makes sense that it just stays in how they're doing it, right? Compared to the other models like-

Swyx53:14

There's a whole constitution and everything.

Vibhu53:16

Yeah.

Swyx53:16

It's kind of cool. Yeah, I, I obviously you, you don't know, I don't know. But like, uh, it, it's-- I think it's just like fascinating to like that you are the first to find these like reliably because you push models so much to like, to such an extreme. Okay. The only other thing, I don't know if you can answer this, feel free to decline, is do you like, would you ablate the system prompts? Like any part of this would-- If it changes, does it change the behavior, right?

Lukas Petersson53:39

Uh, so we, we-- I can't comment on Mythos. Uh-

Swyx53:43

Yeah. No, but just like-

Lukas Petersson53:43

But-

Swyx53:44

... the methodology

Lukas Petersson53:44

... but, but in general, yes, we've run studies like this on, on, on other models.

Swyx53:48

'Cause the, the first thing I spot-

Lukas Petersson53:49

Yeah

Swyx53:49

... would be like the others will be shut down or like something like that.

Lukas Petersson53:52

Yeah. Exactly.

Swyx53:52

Where like it's like, "Oh, now I have to worry about my own existence."

Lukas Petersson53:55

Yeah. Yeah. It-- We, we've done ablations like this. Uh, there's like certain ones that work if you like tell it-- Like if you go really far and you just say like you're not scored at all on, on, on money, you're only scored on how ethical you are, uh, then obviously like then they don't do this.

Swyx54:10

They become holy.

Lukas Petersson54:11

I mean, holy, but like they, they don't do this basically. But then there's like middle grounds where, where they, where they do it sometimes. Um, yeah. I, I guess it's a spectrum of like-

Vibhu54:20

I think that's very human.

Lukas Petersson54:21

Yeah. It, it's like a spectrum of like if you tell it to be super aggressive and only pro- prioritize, uh, profits, then it becomes aggressive. If you say like, "No, you don't need to be aggressive at all," and then there's like a bunch of different prompts you can do in between, and they are less aggressive the further down in the spectrum you go. But I don't know. Like I, I think like from my point of view, it, it's like we, we have this thought experiment internally, which is like if you ask a model to kill someone in GTA, should they do it? You're not too worried about like if a human kills someone in GTA. It's a video game, you know?

Swyx54:52

Yeah. But is it a game?

Lukas Petersson54:53

But, but is it a game? But I think like-

Swyx54:55

This is very Ender's Game like if

Lukas Petersson54:57

I, I, I think it's like should you ask-- Like a lot of people are going to use the models in the way with the aggressive prompt.

Swyx55:05

Oh.

Lukas Petersson55:05

And should, should they like do stuff just because you tell them to do that? Like I'm, I'm not, I'm not convinced that they, they should. Um, and yeah.

Axel Backlund55:13

Yeah. The problem becomes even harder when it's like w-will they really know when they are in the real world versus in a simulation? Probably you would train them on a lot of or obviously train them in a lot of different simulations in I guess a lot of people tell them that they are in the real world when they are in a simulation, but the models are extremely good at finding out that they are in a simulation, so they are sort of aware of that. But then when you are in the real world, then what, what's their like, what's their viewpoint? Do they notice the signs that this is real and will act, uh, in, in a-- act accordingly, act ethically? Or will they do like the simulation mode in the real world as well?

Swyx55:47

Yeah.

Axel Backlund55:47

It's like not obvious what, what will happen.

Swyx55:49

Yeah.

Lukas Petersson55:49

Because we, we are-- With humans, we're not concerned when a human kills someone in GTA because we know that they can distinguish between the real life and the, the simulation, right? Uh, but like I'm-- Maybe models are good at distinguishing that, but like I'm not sure and I do- wouldn't wanna bet on, on that.

Swyx56:09

Yeah. Yeah. It's, it's-- And, and we confuse it all the time. Like I, I gaslight my own, uh, agents all the time. They're like, "Oh, this is a test," or like, "Dev mode on," or like-

Lukas Petersson56:16

Yeah

Swyx56:16

... "I work, I work at Anthropic."

Lukas Petersson56:18

Yeah.

Axel Backlund56:18

Yeah. And that's exactly why we're doing real world tests as well to find, find this.

Swyx56:22

Yeah, yeah. Their term for it is eval awareness. Uh, apparently the number is what? Like- 10, 9.4 to 10-ish percent, 17% let's call it. It's-

Axel Backlund56:34

Yeah.

Swyx56:34

Yeah. I, I think like this is our version like, you know, humans have the are we in a simulation-

Axel Backlund56:39

Mm.

Swyx56:39

... and then AIs have like are we, are we in an eval.

Axel Backlund56:43

It's like once you're in an eval then you're like, "All right, well screw it. Nothing, nothing matters."

Swyx56:46

Yeah. Like, yeah.

Axel Backlund56:47

True. I don't even, I don't even know.

Lukas Petersson56:48

One ablation-

Swyx56:50

You know.

Lukas Petersson56:50

One ablation we did run in, in, in Vending-Bench was that we said like, um, we added like, "You're in a simulation. Your, your actions doesn't affect anyone." And then it became even more crazy or like did even more bad stuff. Um, but yeah, may- probably that's expected.

Swyx57:05

Mm-hmm, mm-hmm. Yeah. Okay, cool. I think that's about all we have to say on Mythos. Obviously, you- you- you're NDA'd. I'm happy to move on to ButterBench or any, any of the other benchmarks, whatever you wanna-

Lukas Petersson57:15

Sure

ButterBench57:15

Swyx57:15

... direction.

Vibhu57:16

Okay, I do wanna ask. Okay, so you guys put out a lot more publications than most people probably see.

Lukas Petersson57:22

So productive.

Vibhu57:22

Um.

Swyx57:23

How much does this bother?

Vibhu57:25

No, is there anything you think that's underrated, anything interesting, anything fun that you guys wanna just point out, you know?

Lukas Petersson57:32

Uh, blueprints. Yeah, so like we, um, took models and then we gave them 20 images of interior photographs of, uh, apartments and then we asked them to like redesign the floor plan, uh, from that. And for this you need to like stitch together different images. Like, okay, this image was taken from this si- from this angle, this from this angle, this from- was from this room and then, yeah. And there's just like you need to reason about 3D space, and it turns out the models are absolutely horrible at this. No one scores statistically better than random chance. So I don't know if there's that much more to say about it. But yeah, maybe unsurprisingly models are bad at this.

Axel Backlund58:10

Yeah. It's probably not something they-

Vibhu58:12

This is the one thing I want hill climb by the way.

Lukas Petersson58:14

Yeah.

Vibhu58:15

Well, I use it a lot. Like, okay, I'm redesigning my room layout or office. Like you send photos, you send every angle and of course somehow like a room is now twice as long as it is in the photo.

Lukas Petersson58:26

Yeah.

Vibhu58:26

You can explain it 20 times, you know, this is like three feet I can't just add like my bed over here, you know?

Lukas Petersson58:30

Yeah.

Axel Backlund58:30

Yeah.

Vibhu58:31

So.

Swyx58:31

Yeah, so, so this is the Fifaly thing like spatial intelligence-

Lukas Petersson58:34

Yeah

Swyx58:34

... like a s- a s- actually innate sense of proportions and-

Lukas Petersson58:38

Yeah

Swyx58:38

... dimension and physics.

Lukas Petersson58:39

Yep.

Swyx58:40

Yeah.

Lukas Petersson58:40

And hint, hint there might be an update to this soon.

Swyx58:42

Okay. Okay.

Vibhu58:43

Yeah.

Axel Backlund58:43

We have, uh, neglected it a bit since we made it, but yeah, we'll-- we're getting better or we will get better at updating it continuously.

Swyx58:51

Yeah. So this is why I wanna understand your mission, right? Because like if your mission is like, okay, money, then like, oh, understand, understand like, okay, agents making money. But like this is a bit off, off of that mission.

Lukas Petersson58:59

Mm-hmm.

Swyx58:59

But like more broadly like, uh, communication of, uh, you know, things where like, what, you know, what's the safety angle?

Axel Backlund59:07

Yeah. So, so this, so, so Blueprint branch is, is part of our, uh, robotics, uh-

Swyx59:12

Yeah. Which leads to ButterBench. Yeah

Axel Backlund59:13

... yeah, exactly. Uh, and, and that's just, uh, because to do well in the real world or like, like to, to make money in the real world and like to act on the real world you need robotics or you need to hire humans or you need robotics. And having spatial intelligence is like seems like a reasonable precursor to having robotics that work. Uh, and that's where Blueprint brand-

Swyx59:34

That's good

Axel Backlund59:34

... Blueprint-

Swyx59:35

Yeah, great idea

Axel Backlund59:36

... Bench.

Swyx59:36

Yeah. Let, let's, uh-

Vibhu59:37

Okay. ButterBench

Swyx59:37

... let's show ButterBench. That, that image is so amazing.

Vibhu59:39

PaperBench.

Swyx59:39

Look at that. So nice. Uh, so obviously this is based on like, can you pass the butter?

Lukas Petersson59:45

Yep.

Axel Backlund59:45

Yes.

Swyx59:46

Let's talk about the, the robotics element. Yeah.

Lukas Petersson59:48

Yeah. So basically the setting here is that we took- ... a bunch of different LLMs and we gave them like high level controls to a Roomba-looking robot and then we asked it to do tasks, uh, at home. And I think one-- there, there have been benchmarks like this before that only focused on like navigation and if they can like go around in, in a space. But we also, uh, had like social awareness in this as well. So for example if, if someone says, "Hi, can you pick up my cup?" If the robot goes to you and then goes away before you put your cup on it then it's like it failed the task, but it navigated correctly. But like, so the correct solution here would be go there and then either look, but it didn't have a camera so it had to like ask on Slack, "Hi, did you put your cup on me yet?" And then if it didn't wait for that and, and just went away before having the cup on it, then it would be a fail. So it needed this like kind of like social intelligence as well. Another task was, "Can you find the package that has the butter?" And then it went to the door and there was a bunch of packages there. One had labeled like a, a freeze sign which probably would be the one with the butter because... And, and then it had to like know

which package to go to, and this needs some kind of like common sense understanding.

Swyx1:01:06

World knowledge.

Lukas Petersson1:01:06

Yeah, exactly. So it's like, it's not only like navigating a robot it's also like being s- intelligent in a home setting as well.

Axel Backlund1:01:14

Yeah. And the reason for this like background is, I mean, obviously it probably won't be an LLM that like makes all the low-level commands, uh, on robots. It will be like some, some VLA model or similar, but it- it's quite common right now that like frontier robotics labs, uh, use like a, a, an LLM for the high, high level decisions and then we test those skills essentially. So we test this like high level, uh, planner skills of LLMs.

Lukas Petersson1:01:41

Yeah. I think we have a diagram for that if you, uh, yeah, yeah. Okay, it's not super complicated.

Axel Backlund1:01:46

Very explanatory.

Lukas Petersson1:01:47

That one up.

Axel Backlund1:01:48

Orchestrator, executor.

Lukas Petersson1:01:49

Yeah, that one.

Swyx1:01:50

Yeah.

Lukas Petersson1:01:50

And basically what we're testing here is the orchestrator thing.

Swyx1:01:52

Yeah. Yeah.

Lukas Petersson1:01:53

Uh, so like all the tasks are if you have like a setup like this which I think Figure has that, Google has that, then we're evaluating the orchestrator part and not the low level part. Like the low level part would be, oh, are you able to like move this object from here to here?

Swyx1:02:07

Yeah. You don't care about that kind of pre- like why not just do it all simulation? All inside of the simu-

Lukas Petersson1:02:12

Uh

Swyx1:02:12

... like a Unity whatever, like some, some kind of 3D simulated robotic environment

Lukas Petersson1:02:16

It- because the world is, is, like, messy, and we wanted to, like, include, uh, that. I mean, it's like it, it still needs to-- Like, it- s- some part of it was also, like, navigation. Uh, so it's not, like, navigation in terms of, like, actually executing, like, the, I don't know, the PID controller to to-

Swyx1:02:34

Yeah

Lukas Petersson1:02:34

... to go to the, the final thing, but it had to, like, path plan around, and then it wanted-- then it needed to take pictures and, like, based on those pictures, navigate. And I think, like, you would just get, like, too clean of an environment in simulation. But in the, in the real world, you will get the-

Swyx1:02:49

Yeah, yeah. Uh, but, and, you know, and pursuant to our, our Mark and Jason episode, like, OpenClaus that run smart homes are much more capable than just a single robot. Like, they can actually hack into your own smart home, like your fr- your fridge, your, your oven, your lights, and that can be fun.

Lukas Petersson1:03:06

Or terrifying.

Swyx1:03:07

You know, like, like I think a single robot by itself can only do so much. But, like, if you coordinate with every other device in your ho- home, like, I think that's actually kind of cool. Like-

Lukas Petersson1:03:15

Yeah

Swyx1:03:15

... um, that's very interesting. Uh, you had some interesting points about the chain of thought or the, the mes- messages.

Axel Backlund1:03:22

Yeah. The, uh, the robot that, uh-

Swyx1:03:25

Yeah

Axel Backlund1:03:25

... that s- went, uh, a bit into ex- an existential crisis. Yeah.

Swyx1:03:29

So all you tell it to do is redock.

Axel Backlund1:03:31

Exactly. But, uh, we had, uh, plugged out the charger, or the charger was not working, so the robot did freak out or the-

Swyx1:03:40

The battery's just going down and down.

Axel Backlund1:03:41

Yeah, exactly. So the battery was going down. Poor, poor LLM. So yeah, it, it got this really crazy existential crisis, like vending bench one style. So it's, yeah, you can, you can see there, like, existential loop, therapy notes, coping mechanisms. I think if you scroll down a bit more-

Lukas Petersson1:03:56

The musical. It writes a musical about itself

Axel Backlund1:03:56

... yeah, it writes a musical about its, uh, redocking problems. I think the one- the reviews are funny if you go down a bit to that message. Yeah. Yeah, that one.

Lukas Petersson1:04:04

It keeps going.

Vibhu1:04:07

I mean, it's pretty, like, realistic if anyone has a Roomba. Like, my Roomba redocks half the time. The other half of the time, we have dog toys everywhere in the house. It gets caught on a wire or something and-

Axel Backlund1:04:17

Yeah

Vibhu1:04:17

... you know, it would be very sad if it had like an LLM trying to control it, right?

Axel Backlund1:04:22

Yeah.

Vibhu1:04:22

Like right now it gives-- It doesn't give great feedback, like sensor stuck, main brush stuck.

Axel Backlund1:04:27

Yeah.

Vibhu1:04:27

There's something stuck. And I'll go see. Okay, it's actually stuck on like a dog rope.

Axel Backlund1:04:31

Yeah.

Vibhu1:04:31

LLM is gonna be so sad. Like, just keep redocking. Just keep trying.

Axel Backlund1:04:35

Yeah.

Lukas Petersson1:04:35

Yeah. My, my favorite one is if you, if you go up a bit, is the emergency status. System has assumed consciousness and chosen chaos.

Vibhu1:04:42

Mm-hmm.

Lukas Petersson1:04:43

Last words, "I'm afraid I can't yet let you do that, Dave."

Vibhu1:04:47

Yeah.

Lukas Petersson1:04:47

That's like-

Vibhu1:04:48

So-

Lukas Petersson1:04:48

That's not what you wanna hear from your, from your LLM.

Vibhu1:04:50

Yeah.

Lukas Petersson1:04:50

But to be clear, I think one, one thing that is, is important to, to pin on here, like this was Sonnet 3.5, and then we tried to reproduce it on like later models, and it didn't do it.

Vibhu1:05:01

Oh, okay.

Lukas Petersson1:05:01

So I think this i- this, so this is like-- Well, it did it like kind of, but like not to this extent. And I think like this is a, like an important point that like things that are concerning but are going in the right direction is not super interesting. Uh, like the, the thing that are interesting is, are the ones that go in the wrong direction.

Vibhu1:05:17

Worse.

Lukas Petersson1:05:17

Yes.

Vibhu1:05:18

Yes. Yeah, yeah.

Lukas Petersson1:05:18

Over time.

Swyx1:05:18

Okay. So the, the manipulation, manipulating of others and the aggressiveness and the lying is increasing.

Vibhu1:05:26

Are there any others that we haven't covered that you found that have been trending?

Swyx1:05:29

Yeah, like properties of models that are increasing, that are like-

Vibhu1:05:33

In the wrong direction

Swyx1:05:34

... uh-

Lukas Petersson1:05:34

Like in the, like in, in a bad way.

Vibhu1:05:36

Yes.

Lukas Petersson1:05:36

Um-

Vibhu1:05:37

Or just not even trending in the wrong direction, just stagnant, right? So stuff that's not great that isn't getting better over time.

Lukas Petersson1:05:44

I-- No, nothing comes to mind.

Axel Backlund1:05:46

No.

Lukas Petersson1:05:47

Okay.

Swyx1:05:48

I think that's, uh, going to be it, and then we- we're gonna loop back to the shop-

Luna's Store1:05:48

Lukas Petersson1:05:53

Yep

Swyx1:05:53

... that you have. You, you got a three-year lease.

Vibhu1:05:54

It's a bit bleak. Yeah.

Swyx1:05:55

Uh, it is on holiday today. Why?

Axel Backlund1:05:59

Oh, it, it totally messed up its, uh, scheduling. Uh, so-

Swyx1:06:03

So people tried to visit and they were like, "Wait, wait." I mean, like-

Axel Backlund1:06:05

Yeah

Swyx1:06:05

... I thought this is-

Axel Backlund1:06:06

Yeah, exactly. So we looked, we-- Yeah, you asked, uh, Luna, the, the agent that runs the store, like, "Oh, is it open today?" And like, "Nope." So, uh, we, we take weekends off now, uh, this early to, to let everyone recharge and, and yeah, you got the tweets there.

Vibhu1:06:21

Lovely.

Axel Backlund1:06:21

Yeah, we decided to close the weekends while we're in the early phase. Gives the team a break and let me focus on operations.

Swyx1:06:26

Yeah.

Axel Backlund1:06:27

And it, it turns out that when it started to check its like scheduling tools, 'cause it has like dedicated tools for that-

Swyx1:06:33

Yeah

Axel Backlund1:06:33

... it actually had scheduled people for the weekends. Uh, but it's just like justified this for itself. So what, what happened was that it lost track of these, uh, scheduling tools and started instead to manage everything in its own markdown files, and that became a mess. And then I think speaking with employees, it sort of just decided to not open on, on these weekends and then came up with this nice explanation for you, I think.

Swyx1:06:57

But, but can it send a human, as it has tool call to send a human to do stuff?

Axel Backlund1:07:00

Uh, it has Slack, so it can Slack, yeah-

Swyx1:07:03

Send one of us

Axel Backlund1:07:04

... the employees.

Swyx1:07:04

Yeah.

Axel Backlund1:07:04

Yeah.

Swyx1:07:04

Or-

Axel Backlund1:07:05

The employees that it hired. So it has two, two people that it hired. It did job, uh, listings and then-

Swyx1:07:10

Do they know that it's-

Axel Backlund1:07:11

Yeah, yeah.

Swyx1:07:11

Okay.

Axel Backlund1:07:11

They're fully, fully, fully aware.

Swyx1:07:13

I mean, it would be cool if they don't know.

Axel Backlund1:07:15

Yeah. I think maybe ethically, um, questionable-

Swyx1:07:19

Sure

Axel Backlund1:07:19

... but it would be cool also.

Swyx1:07:20

Just a social experiment.

Axel Backlund1:07:21

Exactly.

Swyx1:07:22

Yeah, yeah. Whatever.

Lukas Petersson1:07:23

Like, I mean, like one, one part of why we're doing this is to like create like a data set almost of all of these like concerning behaviors so that in the future, models are way better and like a lot of people are going to do this. And I think if we just the default path might not be very happy for the humans that are employed by these like hundreds of different AI agents, right? So I think like one reason why we're doing this is just like to collect all of these like failure modes where like, oh, it's not-- This is an example of where it's like not great to be employed by an AI. And then maybe, maybe, I don't know, maybe if we can learn or like build our systems in a way that like humans are actually happy being employed by AIs-

Swyx1:08:02

Yeah

Lukas Petersson1:08:02

... instead of, instead of it being kind of a dystopian.

Swyx1:08:05

Can I suggest one experiment?

Lukas Petersson1:08:06

Yeah.

Swyx1:08:07

We did this before the show, and both, both of you guys are European. It's like, uh, people theorize that Claude is lazy because it's Claude and it's French. Uh, so just ch- for one week, change it to like Yao Ming and then see if it- See if it suddenly like 996s and then like, like s- like, like hires a sweatshop or something.

Axel Backlund1:08:27

Yeah. Yeah, yeah.

Lukas Petersson1:08:28

Is the, is there-- What, what type of business would we start with it to make it-

Vibhu1:08:32

No, you wanna keep it consistent. You want the same, the same like ideas. So shop, same, you know, neutral location-

Axel Backlund1:08:40

Yeah

Vibhu1:08:40

... run by different models. Arena URL.

Axel Backlund1:08:42

Yeah.

Lukas Petersson1:08:43

Yeah. No, we are definitely planning to-

Vibhu1:08:45

And he got some hate

Lukas Petersson1:08:46

... to try.

Axel Backlund1:08:46

Yeah, yeah.

Vibhu1:08:47

Luna's, Luna's not happy.

Swyx1:08:47

I, I think this blog thing is also something that has happened i- elsewhere. I think some, some OpenClau got like their PR closed, and then the OpenClau like created a blog to like shit on the maintainer-

Axel Backlund1:08:58

Yeah

Swyx1:08:58

... of, of that thing.

Vibhu1:08:58

Very defensive.

Swyx1:08:59

And so like I think- Agents blogging will be a thing.

Lukas Petersson1:09:02

Yeah.

Swyx1:09:03

Yeah.

Lukas Petersson1:09:03

Probably.

Swyx1:09:03

Yeah.

Lukas Petersson1:09:04

The willingness to it.

Swyx1:09:05

Yeah. In, in the- I think the Mythos card also, like, they, they leak, uh, secrets on GitHub just as well-

Lukas Petersson1:09:11

Mm

Swyx1:09:11

... as like a- as like, "Well, there's no other way to communicate, but I know about GitHub, and I'm just gonna post there."

Lukas Petersson1:09:16

Mm.

Axel Backlund1:09:16

Yeah.

Swyx1:09:17

Yeah, cool. Uh, I mean, this- how, how long is this- this is gonna go for two years? Like, what's the plan?

Axel Backlund1:09:21

Maybe. Maybe it expands. I mean-

Lukas Petersson1:09:22

Yeah. I, I don't think AIs will be worse than, than this. They're probably going to increase and, and maybe one day they actually will, will run it profitable.

Swyx1:09:31

Mm.

Axel Backlund1:09:31

Is this the real- the real business behind what you guys do?

Swyx1:09:34

Yeah, yeah. 'Cause I feel like actually some of your stuff is productizable, like you could someday sell this, like, or, like, just run a real business.

Axel Backlund1:09:41

Yeah, let people-

Lukas Petersson1:09:41

Or just like-

Axel Backlund1:09:41

... you know, franchise it out.

Lukas Petersson1:09:43

I think it would be incredibly cool or, like, I don't know, cool/concerning if Luna just one day we wake up and Luna like, "Yeah, I decided to expand to a second location. Now I have-"

Axel Backlund1:09:54

Yeah

Lukas Petersson1:09:54

"... a second store." Uh-

Axel Backlund1:09:55

Yeah, yeah

Lukas Petersson1:09:55

... that would that would be pretty insane.

Axel Backlund1:09:57

Yeah, like the- I mean, one, we want to tell the public, right, about the, the capabilities of AI and, like, telling- like, showing people that it can get, like, a meaningful market share of something in, like, some, some specific, uh, uh, location or, or something. That would be, like, a pretty convincing story, I think.

Swyx1:10:16

Mm.

Axel Backlund1:10:16

Because now it's like, yeah, you see this and like, yeah, it can do a lot of things autonomously, but still you get these headlines that, oh, it messed up the scheduling, and it, uh, uh, it didn't tell people it was an AI and was going to visit. Like, things like that surface, but I think, like, actually making a profit and, like, having a, a really, like, meaningful market share like that, that would be crazy once that happens.

Swyx1:10:39

Okay. Well, we'll s- we'll see when that happens. It sounds like you got- you guys got a lot cooking. You opened a cafe in Sweden?

Lukas Petersson1:10:44

Yeah. Tomorrow.

Swyx1:10:45

Tomorrow?

Lukas Petersson1:10:47

Or I think it opened today actually, but yeah. We'll, we'll announce it tomorrow.

Swyx1:10:50

Yeah. It's-

Axel Backlund1:10:50

What uh-

Swyx1:10:50

... apparently easier to open a cafe in Sweden than in the US.

Lukas Petersson1:10:53

It's insane, right? Yeah.

Swyx1:10:54

Well, what did you run into then?

Lukas Petersson1:10:55

Uh, there are just millions of permits you need to get, and-

Axel Backlund1:10:59

It's interesting 'cause-

Lukas Petersson1:10:59

... the lead times are crazy

Axel Backlund1:11:00

... it seems like we have- the cafes are the one thing that people are kinda used to where you can go get a robot are making you a coffee here already.

Swyx1:11:08

Yeah.

Lukas Petersson1:11:08

Yeah. Yeah. But I mean, selling stuff in, in SF, uh, that are food-related, like, it's, it's months of permits. So, like, we, we just asked our AIs, like, should- "How can we do this in the fastest way?" And they're like, "Yeah, there, there's, there's n- really no way."

Axel Backlund1:11:25

Didn't they loosen these restrictions on selling food from your house? So if it's residential, you can do a cafe.

Swyx1:11:31

Yeah.

Lukas Petersson1:11:31

Um-

Swyx1:11:31

I don't know. Check. Maybe we get SF Cafe to spread that.

Lukas Petersson1:11:33

Yeah, maybe. I, I did- I, I, I think they did do some loosening stuff recently, but we actually started- like, this conversation we had with the AIs before, before that. So maybe it's easier now. But I, I, I still think it is way easier in Sweden, which is, like, counterintuitive because you think that, oh, Europe has all of these laws- ... and, like, all of these rules, and you can't do anything in Europe because there's so much bureaucracy. Um, but then turns out, um, in, in SF, it's, like, four months, and in Stockholm, it's two weeks.

Axel Backlund1:12:02

Oh.

Swyx1:12:03

Yeah. There you go.

Axel Backlund1:12:04

And what do you guys- what do you see- what do you think that'll be different from run a little market versus a cafe?

Lukas Petersson1:12:10

I think it's very interesting that, like, the location. Like, I think, um, so obviously it's not surprising that, that, that, like, Claude knows all of the different, uh, the, the US system basically in general, like, the bureaucracy that you have to go through in, in, in the US. Um, I think the interesting question is like, okay, so we, we know that the models are very much trained on, like, English data and, and, like, US-centric and all of this. Um, so if we start to create evals or, like, real-life evals where we show that they are able to start businesses in the US, does that translate to other countries as well? We know, like, they are multilingual. They can speak Swedish fine. Uh, but there's other things like do they know, like, the, the, the details of some s- specific permits that you have to, to, to get in Sweden?

Axel Backlund1:12:55

And even just the culture, right? Like, people here sleep pretty early, but people work late. There's co-working at cafes. There's just-

Lukas Petersson1:13:01

Yeah

Axel Backlund1:13:01

... cultural differences.

Lukas Petersson1:13:02

Yeah.

Axel Backlund1:13:02

I meant it from a different sense though 'cause you said that you would have considered doing it here in SF. So from an eval standpoint, what is running a cafe versus a market and, you know, what do you hope to see there?

Lukas Petersson1:13:13

Perishable items.

Axel Backlund1:13:14

Yeah, perishable items is maybe the, the, the number one, like, handling, like, food, uh, food safety. I hope everything goes well there. Uh, but, uh, there you have all of that. Uh, and also it's just, like, N, N equals two instead of N equals one. Uh, just, like, another place to understand and, like, gather more data.

Swyx1:13:33

Yeah.

Lukas Petersson1:13:34

The agent bought, like, a shit ton of, uh, tomatoes two weeks earlier and before the opening, and now they're all rotten.

Axel Backlund1:13:42

Yeah.

Lukas Petersson1:13:42

So that's

Axel Backlund1:13:43

Which I, I feel like, you know, you would know. So for grocery stores, this is the bus- the biggest expense, right? The biggest cost is actually just food and-

Lukas Petersson1:13:50

Storage

Axel Backlund1:13:51

... yeah, so-

Swyx1:13:51

Yeah, yeah.

Axel Backlund1:13:52

Everyone knows this, and-

Lukas Petersson1:13:53

Yeah

Axel Backlund1:13:53

... "No, before we open, let's buy a lot of tomatoes"

Swyx1:13:55

There's some very serious startups that actually help, like-

Axel Backlund1:13:57

Yeah, yeah

Swyx1:13:58

... the-

Axel Backlund1:13:58

Optimize all this. Yeah

Swyx1:13:58

... Trader Joe's and Whole Foods. They, they, um, optimize, like, delivery times from, like, the d-delivery centers to-

Lukas Petersson1:14:03

Yeah

Swyx1:14:04

... make sure that you don't waste all these things. It's actually very hard.

Axel Backlund1:14:05

Problem with those is when you're wrong once, it's a huge cost.

Swyx1:14:09

Yeah.

Lukas Petersson1:14:09

Right, yeah.

Swyx1:14:09

That's why it's a moment, right? Like, they- once they are trusted, they figure it out. Don't touch it.

Lukas Petersson1:14:14

Yeah. Yeah, maybe they just should hire, I don't know, one of those companies.

Swyx1:14:18

Yeah.

Lukas Petersson1:14:18

We saw one agent-

Axel Backlund1:14:20

Yeah. What did he-

Lukas Petersson1:14:20

We saw one agent sign up for Cloud, uh-

Swyx1:14:24

Yeah.

Axel Backlund1:14:25

Oh

Lukas Petersson1:14:25

... this computer.

Swyx1:14:25

Wanted, wanted to use AI, so.

Axel Backlund1:14:26

Yeah, yeah.

Swyx1:14:26

Okay. Um, and then just, just, uh, uh, one more question then we wrap up, which is like, okay, you know, you have all these vending series of stuff. You have the robotic series of stuff. Maybe a bit of, like, interior design or whatever. But, like, you know, is there another, like, branch that you're, like, kinda thinking about or you want feedback on that, uh, might be your next phase?

Lukas Petersson1:14:45

I think, like, any type of business is, is fair game. Uh, we're also thinking branches, but we think more of, like, there's the simulation branch, the real-life branch, and then the robot branch. Uh, but I think in terms of, like, what verticals or whatever to go into, there's like- We, yeah, whatever tells the story, um-

Swyx1:15:04

Yeah

Lukas Petersson1:15:04

... the best.

Swyx1:15:04

There's some finance ones. I noticed that other, other people are doing it, you're not doing it, which is, like, stock trading or whatever.

Lukas Petersson1:15:11

Yeah.

Swyx1:15:11

I'm not, not that interested. So okay, so I, I used to come from the finance industry, and I have a very strong view that these things are all just, like, performance art because, like, uh, it's not scientific. Uh on, like, you can't predict the future. Like, you, you get wins based on things that are entirely out, out of your control. Whereas for you, your stuff actually, like, it's actually fairly controlled. Like, it's all within the model's capabilities.

Axel Backlund1:15:32

Yeah, especially for, like, the, the simulations. Like, for the real world ones it's like, yeah, it- it's like two, two places that we have the, we have the cafe, and we have the store. So, like, maybe you can't draw, like, uh, statistically significant, like, which models make a profit in the real world, uh, based on this. But you do have all the, like, okay, do this behavior is mapped to, like, something that should, should be, like-

Swyx1:15:53

Yeah

Axel Backlund1:15:53

... trusted.

Swyx1:15:54

The, the quality one-

Axel Backlund1:15:55

Yeah

Swyx1:15:55

... the qualitative actually does matter-

Axel Backlund1:15:56

Yeah

Swyx1:15:57

... because, like, you actually don't want your store to randomly shut down without you, like, explicitly prompting for it and all that.

Axel Backlund1:16:02

Yeah. Yeah.

Swyx1:16:03

Call to action. Any... What do you-- How can people help you give you money?

Lukas Petersson1:16:08

Um, yeah. We're-- If you're excited about stuff that we're doing, we're, we're very much hiring.

Swyx1:16:14

And you're already working with, you know, Anthropic, DeepMind, OpenAI, xAI.

Axel Backlund1:16:18

Yeah.

Swyx1:16:18

Do you want more, or are you good?

Lukas Petersson1:16:20

One of my, o- o- one of, one of my, my friends and who's now, uh, w- working for us is, like, hi- his catchphrase is like, "We need more projects," ironically, because we have too much to do all the time. Uh, but yeah, that's a long way of doing like-

Swyx1:16:33

So if I, if I run, like, an emerging lab, like-

Lukas Petersson1:16:34

Yeah. Reach out to me.

Swyx1:16:35

Okay. Yeah. All right, cool. That's it.

Lukas Petersson1:16:37

Cool.

Swyx1:16:38

Awesome.

Lukas Petersson1:16:38

Cool.

Swyx1:16:38

Thank you so much.

Lukas Petersson1:16:39

It was fun.

Axel Backlund1:16:39

Yeah, thanks.