⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @Ahmad Awais , CommandCode.ai — Latent Space

Intro0:00

Swyx0:04

Okay. We are here in the remote studio with Ahmad Awais. Uh, how are you?

Ahmad Awais0:08

I'm doing great. Thanks a lot for having me.

Swyx0:11

Yeah. You and I have known each other since before AI. Uh, you were very active in the WordPress community. Uh, I don't know how I first came across you beyond that. Uh, I think just general web stuff.

Ahmad Awais0:22

Maybe DevRel, maybe DevRel. I think you... Uh, before you had joined Netlify.

Swyx0:27

Yeah, but y- uh, were you ever a professional DevRel?

Ahmad Awais0:30

Yeah, I think I used to lead like... I was VP of DevRel at RapidAPI for a while.

Swyx0:35

Oh, that's right. That's right. Yes, yes, yes, yes.

Ahmad Awais0:38

That is as professional as one gets, right?

Swyx0:40

No, no, no, 'cause I, I always see you as, like, an independent creator type person.

Ahmad Awais0:44

No, I, like, I've worked at a bunch of different places with Google, with Airbnb, and mostly, mostly been, like, this open source guy where I have, like, published, like, three hundred plus open source repositories.

Swyx0:55

Yeah.

Ahmad Awais0:56

Everything is semantics after that. You know, my open source work took me places and, you know, like, you know, like you, I create a lot of content, so DevRel.

Swyx1:06

Yeah. Okay. So tell us about the path into CommandCode and then, uh, you know, we're, we're gonna highlight some of the work you did recently.

Background1:12

Ahmad Awais1:12

I think the story kind of starts at C- C- COVID. You know, I basically did a Corona CLI thing that went viral. So COVID was at its peak. I was traveling a lot being in DevRel and whatnot. And I think Greg Brockman and Sam Altman ended up giving me access to GPT-3 early. So since then, I've been an AI engineer, right? And the first thing I did, in fact, I was just looking at it, it was July 2020, and Greg sent me a message like: "What is the use case? What, what are you going to use this API for?" And I told him, I'm going to suggest the next line of code, like a code snippet, right? This is year and three-- more than a year before GitHub Copilot was a thing, right?

Swyx1:52

Mm.

Ahmad Awais1:53

So I started building this thing called CLAI. I've always been a big fan of building CLIs, so, you know, uh, silly side project, CLAI, right? And, uh, I think that has eventually became CommandCode now. Like, you know, a short history is, uh, we ended up building an AI cloud. Now everything is an AI cloud. It was called LangBase. Uh, it grew quite big, uh, one point two billion agent runs a month, uh, ended up building an, uh, a memory infrastructure or whatnot. And now we've sort of like pivoted into this feeling that there is only one type of agent, and that is a coding agent. It can do it all, right? So why hide that capability behind some memory system or some primitive or whatnot? In that a six-year old, old code base has eventually turned into this thing we call CommandCode, right? And CommandCode actually started with this feeling that I was using CommandCode a lot more than other coding agents personally. And then a couple of team members started adopting it, and we started ending up building this meta neurosymbolic model called, uh, you know, uh, Taste One. The thing it does is... Like, I have a lot of experience with code. Like I've been, I think, writing code for twenty seven years or something, right? After publishing three hundred plus open source projects, you get to have a lot of opinions on things. And I mostly find myself working on things that

Taste Framework2:51

Ahmad Awais3:09

are super cutting-edge, so there are no docs that an AI agent can go read or whatnot, right? So at that time, I feel like, you know, my opinions matter more than what an LLM can actually find or what you can do with RAG or whatever. So I ended up encoding this behavior, uh, in meta neurosymbolics, a neurosymbolic architecture, where if you learn something from me, document it for me like a skill, right? And we started calling it Taste, right? If you see me prefer PNPM a lot, but publish or link, uh, my local CLI with NPM global link, right? They'd learn that I prefer PNPM for installing packages in almost every other thing, but when I'm linking my CLI locally, I'm using NPM. So these type of things and learnings. And that eventually ended up becoming taste files, which are very similar to skill files. You can think of it like CommandCode automatically learns from you on a per repository basis, so your team, right? And it builds a library of skills, which is quite less verbose, right? They're not like, you know, everything is not in there. It's like things that it sees rep-- as like repeated prep- preferences and patterns across your work, right? And it could be coming from so many different coding agents or whatnot. When we-- when you merge something into main, that is when we can trigger what were your accepts, edits, and rejects overall. But that's not what

we are ta- uh, here to talk about. You know how the silly thing about anything in tech is the thing that you think will go viral or will get adopted never does, and the thing you basically just, you know, off the cuff ignore, like who cares about open models, it becomes a thing, right? So it is twenty-fifth of May today, and about literally twenty-five days ago, everybody was talking about how DeepSeek is so good, and there were a lot of people talking about how DeepSeek is so bad and why are people saying that it is so good. So I was in the, in the camp of, "Well, I, I need to make a decision on if DeepSeek v4 Pro is actually as good as Opus," right? And we are doing anywhere from like, you know, a couple billion tokens a day at that time. We are like a hundred times more than that now. I ended up discovering a thing that I call tool confusion, which is very odd. We, we have been able to figure out how to deterministically fix tool calling for open source models or open models. It's been such an amazing thing that a lot of people have been trying it out through a CommandCode, and I also made a bunch of that, like, completely open. You can, you, you can go and implement this in any coding harness because, you know, uh, me being me, this is not that important,

Tool Calling Repair4:48

Ahmad Awais5:47

right? So, but yeah, let me know, uh, how, how deep you want me to go into that topic. I'm, I'm very excited to talk about it.

Swyx5:54

Yeah. Uh, j- I, I think we can get-- dive right into it. I think you have a viral post that you want to share. Uh, and also, I think frame the problem for people who maybe- Are only used to OpenAI, right? They di- they don't know what a tool-calling problem, uh, is

Ahmad Awais6:08

Think of it like this, from our whole vibe is command code, right? You, you should be able to command your LLMs to do whatever you want them to do, right? That is, that is the vibe we are going for, right? First thing was to build taste so we can steer models in the right direction. And then the next best thing happened, you know, programmers stopped writing code. So who are we going to learn taste from, right? So we started learning taste from models, because models are writing code. If you spend a billion tokens on Claude, on GPT, on DeepSeek, you discover a lot of discrepancies across the stack and different full-stack systems like DeepSeek is not that good at code code for some reason. And all of those, I started calling them variations, and we discovered a deterministic pattern across those variations. So when, uh, you're running a coding agent, tool call on a very basic thing is like, you know, if you're trying to run a bash or shell tool to discover something, like you're trying to list what the directories are, you're trying to read through a bunch of different files. User has asked about a question about, you know, maybe your how is authentication built in this repository. And your coding agent is right now trying to figure out with a bunch of tools that are pre-cooked in that I need to list all the files, read a bunch of them, explore

them, and then answer the user, right? So what of... What, what a lot of open models are doing is, I think they somehow suck at tool calling, right? And the pattern is not like super broad, right? It is like, you know, when this is very specific to DeepSeek. So DeepSeek v4 Pro has this weird alpha male energy, where whatever it sends you, it thinks that that is the right thing to do. And if you s- uh, if it is sending you wrong schema of the tool calls and you send back a Zod error, it doesn't listen to you. It would repeat that same thing for like fifty-six times on average in a billion tokens-

Swyx8:04

Yeah

Ahmad Awais8:04

... where you'll be like, "Why are you-- Why is it doing that?" Right?

Swyx8:07

Why don't-

Ahmad Awais8:07

So-

Swyx8:08

Why don't they listen to the errors? I, I thought that's like a very common thing. Like Instructor used to do that.

Ahmad Awais8:12

I have no idea. Like, you know, as a programmer, I would think that, you know, if my tool call is failing, all I have to do is send some schema back, like Zod error-

Swyx8:22

Yeah

Ahmad Awais8:22

... and an LLM that is smart enough, right? And it will just do it, right? My gut feeling is it, it might be a heartache be-because they are... I think a lot of these open models are built or trained in a system where they are learning from data, which they consider really high quality. Like they are learning from a better model than them or something like that, right? And their training is, whatever you are being told is right, it's correct. So their entire nature is, "Whatever I am telling you is also correct, so don't try to correct me." This is, again, quite a lot of this is vibe-based. And I discovered this because, like it's, it's a silly thing. Like, for example, I shared it in my post as well. Do you mind if I share it? Like share the screen or something?

Swyx9:05

No, go ahead. I think, uh-

Ahmad Awais9:06

Okay

Swyx9:06

... you have the ability. We dry tested this before we recorded.

Ahmad Awais9:09

Yep. If we can go back.

Swyx9:10

It's easier to, for people to look at. They can look up the-

Ahmad Awais9:13

Yes

Swyx9:13

... tweet while we're talking as well. Uh, just make sure you zoom in, obviously. Uh-

Ahmad Awais9:18

Yes. So this was the post, right? So this was me being like, you know, I just discovered why and how DeepSeek can outperform Opus 4.7. So this is Opus 4.7, not 4.6. 4.6 is a better model, right? So and the thing that we discovered was like it was sending ins- where-wherever you have some tool, like a shell tool, and it has some parameters or arguments that should be optional, it would send some weird thing on that, like an empty object or, or a null where it is, uh, it doesn't belong. And Zod, being super strict, would just trip up and send back that error, and it will get back the same result again and again. Same tool call, right? So instead of sending back that error, I ended up repairing that, right? It started with just like three, thirty-two hundred lines of four repairs. Think of it, this repair logic, like, you know, database migrations. You know, you have one migration per file. So I ended up creating repair files. Like if you see something like this where, you know, it is emitting, you know, JSON strings type of data, when I actually wanted an array, I can determinately, deterministically fix that to an array, right? And when I do that, I will not only just send back the result, I will also send back a note, a hint, a repair hint that, you know, you should have sent me this type of data, but

here is the result anyway. Think of it like you're teaching somebody how to drive a car and they are about to, you know, hit another car. Instead of telling them what to do correctly, you will first try to save them, and then you will explain like why you saved them, right? Like how, what they should have done in the first place. I think a lot of these models really like this repair logic, because what we saw is, uh, the moment you send the result with the repair logic, right after that, the third tool call is fixed. Instead of, you know, it, it, all of a sudden becomes super smart. It understands like, "Okay, I got the result what I was looking for, and I'm, I'm gonna do this." And it shows up in so many different places as well. Like, for example, it is trying to read a file, and it is not giving you the offset of are you trying to read the hundred lines from the top or at the bottom of this file? So I just make a judgment call, like, okay, it's the first time it's going to read that file, so let's give it the first hundred files. Uh, first hundred lines of that file, right? And then it realizes very quickly, oh, I, I was actually trying to read a log file and then needed the last hundred, and it can very easily understand instead of, you know, those

fifty-plus on average tool call failures. And something like this should be very obvious to developers, but what's happening is a lot of developers actually use Claude code- They hack their, you know, change the base API endpoint and the, you know, API key, and they try to use CloudCode as the hardness for open models. In CloudCode, uh, you know, you know, they hide a lot of the errors behind Control O, right? So you don't even know that, you know, you have like fifty plus tool call failures plus per session. You're just sitting there and you're like, "Oh, why is DeepSeek so slow?" And they are-- i-it's in their vested interest to not fix that. Like, why-- They don't care about open models. They didn't build their coding agent for open models, so it works out really well for them. But f-the common thing that the common theme across Twitter is, "Oh, it's so good," and, "Oh, it's so bad, it's super slow." Right? So it-- I feel like this always ends up being a tool call, a hardness issue than, you know, an actual model issue. Like it's, it can be as silly as something like this. When it's sending the read file path, right, it would create some markdown link in it for no reason at all. And this is super deterministically fixable, right? So you don't have to like waste your tokens on it or anything like that, right? Um-

Coding Harness Issues12:04

Swyx13:25

Yeah, I think that's-

Ahmad Awais13:25

And... Go ahead.

Swyx13:29

Well, I mean, I, I think the reading the data and actually looking at like what you actually send, I think it's, it's really helpful. I think it's also like, I wonder if it's only DeepSeek that's like showing this kind of stuff or is it just a general open models trick?

Ahmad Awais13:44

Yeah, I think, uh, the first thing, again, me being me, I thought this is just DeepSeek. Then I looked at our, you know, logs for last thirty days, and Kimi is doing exactly the same thing. Then we fixed the, uh, Kimi models, then we fixed Minimax models, and now we have like sixteen thousand different repair, you know, variations across hundreds of billions of tokens. We are doing anywhere from six hundred billion tokens right now, and the data on failure of those tool call is super, super important. And overall, this goes from a model that was practically completely not useful, like DeepSeek, uh, V4 Flash, to something that can actually compete with this, right? Like w- uh, uh, and this is like more of a vibe check when I pushed out this update. One of our investors, uh, Tom Preston-Werner's fund, uh, PW, their GP, he was like, "What did you do? Why is DeepSeek V Flash, V4 Flash super solid now?" Like the vibe of the model completely changes. It starts doing things in a different way. I don't know if you've seen this, but like if you run any coding agent with permissions on, the models are actually dumber. And if you run them without, uh, you know, the com-complete bypass of permissions, they do much better. Even if you like sit through those yes, yes, yes, accept or whatnot, you will see that, you know, the model ends up s- getting steered in

the wrong direction because of the slowness of permission blocks or whatnot. Maybe that's not how the models are trained or whatnot, but it's, it's the same thing with models. Like if they are seeing a lot less tool call errors, they are much more creative. They are mu- they can explore a lot, and they can continue a lot longer. Like one of our users has actually done like, I think, seventy billion tokens on DeepSeek. I was looking at his data. He broke our usage page, right? That's how we discovered him. And he's like, "I constantly run DeepSeek with CommandCode for twelve-plus-hour-long sessions." I personally have not done that, right? So when, uh, I think a lot less tool calls are happening, tool errors are, tool confusion is happening. This repair logic kind of blows you away, like, you know, how good open models can be overall.

Swyx16:03

Yeah. That, that's, uh, really fascinating. I mean, I, I guess, what else have you found? Is, is there, is there sort of ongoing work? You know, it's been a, it's been a little bit since you've done this. Uh, you've generalized a bit. Uh, is there ongoing work in, in this area or, uh, you know, are you sort of basically constantly spinning out insights like this from CommandCode?

Go Plan16:23

Ahmad Awais16:24

Yeah, it's like... So one of the first things that happened was, I think we might be the most used coding agent out there right now for DeepSeek. So we are doing so much inference on DeepSeek. So a lot of people actually tagged a bunch of people, uh, on DeepSeek research team. And to prove this, actually, we launched a Go plan with just dollar one per month to where you can do like six hundred million tokens of DeepSeek V for Pro in it, just to prove like open models are actually really, really good and they are catching up, right? And I think that kind of percolated to... I, I, I, I think I would-- it wouldn't be too far to say that DeepSeek saw that they can discount their prices and do the same and show people that their models are actually really, really good. But also I saw a lot of people from different coding agent harness companies being tagged, uh, in this suite like, you know, uh, like is, is this thing doing that, right? There's like also Pi where people can now easily just put this entire thing into a prompt and get Pi to fix itself, right? So things like that. But one of the most interesting things that just recently happened was, uh, we have been able to apply the same thing to design slop.

Design Slop Fix17:35

Ahmad Awais17:45

You know, like, you know, that indigo purple gradient thing that all LLMs do.

Swyx17:50

I think Mario Zechner has a current, uh, viral post about it right now.

Ahmad Awais17:54

Uh-huh.

Swyx17:54

Basically, if you, if y- Are you familiar with Mario Zechner?

Ahmad Awais17:57

Yeah, yeah.

Swyx17:59

Yeah. Yeah, he, if you go to his profile, he has an example of design slop.

Ahmad Awais18:03

There you go, right. Uh, I think everybody just knows. So I love the purple color, and I would like to point out that it's the indigo slop that is happening, not the purple slop. Right? But, but what we have found is we can deploy the same model to fixing design slop across the hundreds of billions of tokens that we have done. We looked at the numbers, and we found that obviously we chatted with a bunch of amazing designers with amazing design taste, and we found out that it is also very similar a problem. There's a finite set of things that most LLMs do, and if you can give them like a compositional framework of sorts, like you can repair their design thinking, the same thing applies, right? Like for example, these are like probably 10 big rules. You can put them in a skill file or anything, and you will see your design get better. We've seen this repeated across different LLMs, commercial or open whatsoever. And it's the same repair tool logic of sorts, right? Like you are basically guiding your coding agent that, you know, you're not going to do this, this, this, this and this. And if it does, you are, uh, deterministically fixing it. Like one of the things that you might find, uh, uh, really good here is first one was this, the work of pattern first composition where, uh, when you ask a model, "Go and design me this

dashboard," they generally do not think about the intention behind that design, and they just, uh, slop you with that, you know, three cards in a row and, you know, one border on the left side or top side or whatnot. A pr- pretty common, uh, thing. Uh, if you give them a very simple framework of, you know, what type of surface area are we looking for, which is literally just these seven patterns, they do really, really well. And things like, uh-

Swyx20:03

Where, where are these seven patterns from? Is... Did you just come up with it?

Ahmad Awais20:07

Yeah. I've talked to a bunch of different, uh, designers, uh-

Swyx20:10

Yeah. I was just wondering if there's like a book or some, you know... Yeah. Is this like lore that people just know?

Ahmad Awais20:17

Pe- people, uh... I think th- you could, you could think of it like, you know, when a designer is designing a, a dashboard, they're thinking about like this is a monitor surface, right? Like we are trying to monitor. That is the intention behind this, right? And this is a- across the chats we had with a bunch of amazing designers, this is how they work, and we thought if we could turn this into a slash design skill, see how LLMs will repair themselves. We also discovered something really, really, uh... We are like, I personally don't use OKLCH, but apparently LLMs are really good at it. And if you see them using HSL or something, they are, they, they, they, they don't actually are able to control the lightness in HSL very quickly. But on to human eye, it's very, very easy to see like this color and this color do not look the same. But if you force an LLM to use OKLCH, they can control the color palette really, really well compared to any of other things. And this is what a lot of my designer friends do as well. They love using-

Swyx21:21

Yeah, I mean, I think they... That's the reason we invented OKLCH, right? Uh, because they did- we, we realized... I mean, ev- every time color theory advances, so then we also have to change the CSS functions, which is mildly annoying for learning CSS.

Ahmad Awais21:35

So yeah. It, it, it... I can't shake this feeling, but it's like, uh, uh... I actually wrote about this just last night. This is from last night. Like, I think there, there, uh, the design slop thing somewhere here, like this is new, so I'm struggling to find where that is. They-- We only have like 24 reference documents, 10 co- uh, you know, design smells, and 7 patterns that we saw across different designers. And when we designed a bunch of different landing pages and got those designers to look at them, it takes about 1.5 second for them to s- uh, spot, "Oh, this is from AI, and this is somebody human looked at it," right? So the differences that they spotted when we wrote them down, they are just deterministically fixable patterns. Like you can convert HSL into OKLCH, or you can force an LLM to do that, right? And it's like a... feels like you can fix 90% of a design slop, which is not a capability gap. It's more like a contract gap in what your hardness is telling an LLM to do versus w- what your user is saying. User's always going to say, "Fix my design," you know, "Make it prettier. Make it pop," or something like this. If you can give them a framework of this is what the design taste of a really good designer is like, they will pick this type of color scheme. They will pick... They will

think about intent before starting to implement that landing page. What is the intent here, right? It's a short contract, but it's... it makes your design slop really, really uh, minimal, I would, I would probably add, right?

Swyx23:10

Yeah. I think this is a really good overview. And, and, uh, for those who don't know, like, uh, Ahmad actually h- has a lot of authority also on the design side. I actually don't even super know everything that you've done, but, uh, you know, I, I, I definitely will listen, uh, when you have thoughts on the intersection of design and code.

Ahmad Awais23:31

Thanks a lot. I, I, I feel like I, I wanted to be a designer. I was never really that good at it, you know, like I couldn't sketch or anything. But I think the, uh, this is, this is a feeling that I have, like a lot of people are right now able to build just about anything, and we are now differentiating between good work and bad work based on their design. So enabling every builder using a coding agent to be able to design like a designer is something that is very close to my heart, right?

Swyx24:02

Yeah, I think that's some... Like, I think we're all a lot becoming more generalist, and there's definitely things that I would not dare to do previously, uh, that I, that I'll now try with AI.

Like, yeah, let's see. Look at-- It looks like you wanted to show us CommandCode. Yeah. Go ahead.

Ahmad Awais24:20

Yeah. It's, it's like, uh, for example, this is the landing page of our, uh, documentation, right? As a developer, I would not have gone through the trouble of creating all of this, right? But when I fed that to our design skill, so CommandCode now comes up with the-- comes bundled with a slash design skill, which has like these, all of these references in there cooked, pre-cooked in. Like I was just las-last night fixing our deals that we have, like Qwen two point six is, you know, fifty percent off. So I fed that a very basic screenshot of, you know, all of this mess, and this is what it converted into. And it, to me, it looks really, really good. I'm like, "Okay, this is, this feels like, you know, like a ticket that you could print up, you know, on a, I don't know, like a cinema ticket or whatnot." And from deal that it basically understood the intention behind this thing and tried to recreate that design when I, as a programmer, only told it, "Here's a deal, here's the data just add that to our docs," right?

Swyx25:24

Yeah.

Ahmad Awais25:24

So giving that framework of, you know, this is not like tool calling or anything. It's like a really nice framework of thinking that LLMs have this capability cooked in. They can organize their thoughts really, really well. They can design really, really well if you just give them the right way to think about those things, right? Same thing, uh, we did with, uh, tool calling. Same, same concept basically applies to, uh, you know, design. And one of our team members right now is working with a community, uh, of security people. They think that the same problem can be applied to security as well. Like you look through the logs, and you figure out that this is the most common pattern security-wise, this is what is, uh, brewing up, and they can apply automatic fixes to your packages or whatnot, and make LLMs really, really kind of like coherent on following your guidelines of never write bad secure or bad or like poor secure code.

Swyx26:25

Yeah. I think, like, that's something that we all want to, like, get smart on. Can you give us like an overview of just like the differentiators that, that, uh, people should be aware of? Because I think, like, this is one of them that's like, that you've been working on. Um, you know, taste is, is something that you've, you're, you're definitely focusing on, on owning. Are you still pursuing your own models, or are you, are you mostly just gonna be like the best CLI for open models?

Taste vs Skills26:52

Ahmad Awais26:52

Yeah, I think there are like two directions right now that we, uh, sort of own. Uh, taste thing is, again, this is like a CLI, right? It, it is like full-fledged coding agent, does everything that you can expect any coding agent to do. It has both, uh, you know, commercial models or open models. It just that, that we have found our PMF of source in the open models market more than, you know, Claude is actually really, really lenient with tool calls. So even if, you know, your coding agent harness messes up, it can figure out that, "Oh, I'm, I'm being sent this error," and can fix itself. Not the case with, uh, you know, uh, open models. But the taste thing is, uh, still there. Like, uh, it basically sort of like works like this. This is a very common, uh, site that I have, and I'm using something, please use TypeScript. It's using TSE. I, I want you to use TS, uh, TSUp, right? It's using some different framework of testing. I prefer a test, and it's like a lot of, you know, back and forth-

Swyx27:53

Back and forth. Yeah

Ahmad Awais27:53

... in getting your coding agent to do what you wanted done in the first place. But with taste, taste is like a meta neurosymbolic model. Like, I have a lot of, uh, I try to like, based on the feedback, I've hidden it very well in doc somewhere. So, you know, developers don't have to go and read the-

Swyx28:12

My God. What the hell is that?

Ahmad Awais28:13

... you know, silly, silly things. Right? The entire KL divergence loop that we look at, like if an LLM already knows about something, it should not end up in your, you know, skill or taste file. That is absolutely useless context, right? So it basically does all of this weird thing, where once you actually go through... Where is that link? Once you actually just go through all of this, once you have built a CLI or an API or a front-end project with CommandCode, it actually ends up learning a bunch of those rules that are automatically managed for you. So, for example, if you are using Commander for building CLIs, and now in this particular project, you start using Meow, it will replace that for you. The entire idea is that your skills are being automatically learned and automatically managed, and they are absolutely transparent. They are in your repository, not in our model. So you're reviewing it in every PR. You're looking at like, "Yeah, I, I don't want to," you know, save something like this or whatnot, and they're never stale. A lot of issues that I feel a, a lot of people face are because their agents.md or claude.md has some wrong information when they sat down. Uh, this is like one thing I discussed, uh, when, uh, we launched, we announced our, you know, five million dollar seed, that a lot of people when they sit down, they think of the rules or

whatnot. They think in the terms of grandiose things, right? Rules are like, you know... Let me actually zoom in. So this is like-

Swyx29:48

Yeah

Ahmad Awais29:48

... the difference between a skill or rules file or a taste file, right? This is what you are writing down, and taste says what is continuously being learned from your prompts or your edits, and they are being stored in the same markdown file for you, right? And updates are when you remember to do it. And most of the time I've seen we, we, as humans, we sit down, and we do this grandiose thing. Like, you know, use this. I always prefer that or whatnot. And tastes are a lot of micro decisions, not too broad. When you are doing this, when you are running a slash PR Q command. Sometimes Ahmad prefers that, you know, you basically fetch the latest from main, rebase it, send the PR on a branch, and then go back to the main branch. That is what my PR workflow is. That is not what my PR file says, so it automatically goes and fixes it, right? And I-- it has seen me go back to the main branch again and again. It's like, okay, it's just one-liner for it, but improves my workflow without me having to take care of it. And over, you know, over the time, it basically compounds a lot. Like, if you have... One of the guys, I was looking at his code. He built one Android app with command code, and now he's like, "You know, I'm able to build more Android apps very, very quickly."

It's, it's the same thing, but it is learning the scale from you as you build those things, right? And over the time, we have seen, like, you know, uh, it becomes really helpful. Like, somewhere down there, we ran a study with like seventy plus developers in the number of times that they had to go edit files because their LLM made a different, uh, you know, the scene took a different turn or steer their LLM like, "Yeah, don't do this. Don't use this. Don't use tRPC or something, and use Honu or whatever for this part of API." They found that the number of edits or steers went down. Right?

Swyx31:47

So, so let me ask some clarifying questions.

Ahmad Awais31:49

Yeah. Yeah.

Swyx31:50

You, you, you... On the website it says model, but then you also say it's like portable, uh, between systems and stuff. Like, is it a-- Is, is there a system of files? Like, obviously this is a, a form of memory.

Ahmad Awais32:02

Mm-hmm.

Swyx32:02

How does it compare with skills? Do I-- Should I prefer skills? Can I use them side by side? There's all these questions.

Ahmad Awais32:08

Yeah, yeah. So one thing is, I actually got this question so many times. I have, like, uh, written, like, like, a blog post on it, uh, skills versus taste, right? What is the difference? The-- At the very basic layer, taste is the highest order bit, which is managing your skills and rules. Skill taste is this automatic engine of sorts that is, that is creating skills for you, making sure they're not stale, and you can obviously go edit them yourself as well. It-- And overall, it actually looks like this, right? The-- After looking at... I think I probably have built more than seventy CLIs with command code so far, and this is the entirety of my taste for building CLIs, this little thing, right? So it knows that, for example, uh, I use PNPM-only build dependencies thing. It knows that I always prefer starting from zero point, zero point one. It's like those silly little things. And, uh, this is like as... All I have to do is I have to do this NPX taste pull. We have the taste package as well, by the way, right?

Swyx33:12

Nice.

Ahmad Awais33:14

Yep. It's, it's just pulls this particular file, puts it in your, uh, repository, and then all you have to do is ask any coding agent, then follow my taste of building CLIs and build me a CLI that does this, this, this and that, and then by the end, show the taste compliance. And it would go-- Any LLM would just go through all this list and be able to figure out, oh, I, I, I was supposed to use this. I was supposed to use Clack for interactive things or whatnot. It works out really well. And then when you start continuously building on top of it, and you change every project, every CLI is different from other CLIs, this taste of sorts changes in that particular repository, right? So it lives in your Git repo anyway, right?

Swyx33:59

Yeah. Got it.

Ahmad Awais34:00

And all of this was automatically created. But you can, yeah, you can come in and edit it yourself as well. It's a markdown file.

Swyx34:07

Then the confidence, I guess, is LLM-generated as well, right? So if you-

Ahmad Awais34:11

Yeah. But like, but like if you are-

Swyx34:12

Always do it, it's one, you know. Yeah.

Ahmad Awais34:15

Something like this.

Swyx34:16

Yeah.

Ahmad Awais34:17

Like, the, the problem we actually hit was... The funny story there, we actually tried to hide this early on. We're like, if we don't show you the learnings, and we just made you compare Claude Code versus Command Code, and pe- developers had like this wow moment, like, "Oh wow, this is so good," um, they didn't know what was being learned. But the problem we hit is we don't know how to merge different taste files. It is extremely, uh, a human endeavor. Think of it like fifteen different engineers working on a GitHub repository with a thousand different GitHub branches. And taste files are so different in all those taste different branches. And when you are merging that back into the main, we cannot judge what you want to keep or what you don't want to keep, right? So we made it really, really transparent. And this is before their, uh, skills existed, by the way. This is from last October, right? There was no concept of skills. There was APIs, CLIs, and MCPs, and we ended up creating this and then, you know, the skills thing came around. So if you as a human are writing something, I think you, you can just probably write confidence score of one, you know what you are talking about, right? If something is being learned from you, some of it is really good, some of it is not. Like, for example, I always have this local option, which I

don't want to show up in my help commands, because this is a option for debugging local, uh, you know, setup of my APIs. So very little, uh, thing. I will never personally go and write this learning in a skill file. It's just too minor- ... right, uh, for me. But it's really good that, you know, uh, it picks up on small things like this and works it out, right? Uh, one of the things that I've seen in our Discord community, which is blowing up, uh, after all of the, you know, tool confusion fixer, is a lot of people, what they are doing is they're building one project with a really high quality LLM, like Opus or GPT Five point five, right? They are building a taste file, and then they're using, you know, super cheap models to continuously build on that mo- you know, project with that taste file.

Swyx36:25

Yeah, I mean, I think, I think that's a, that's a system that kind of works, makes sense for me. Cool. I mean, I think it's a great overview. You know, if people are interested, they can, they can obviously look you up and, uh, chat more. Uh, you guys, you guys have been building for a good while in, in this space, and like, uh, it's really cool to see that you're sort of honing in on, on this coding use case. I, I have also obviously- ... be-become more interested in coding, uh, AI agents for my, for myself as well. Any final words? Any other things that, uh, people can look forward to?

Ahmad Awais36:57

I can probably share just a little bit about what, uh, our roadmap looks like, where we are headed next, right? So we are going to open source CommandCode very, very soon. I'm hoping we can announce that on the AI Engineering Conference, NSF. Uh, if we can work out the quirks of six or six years old, you know, repository. The idea behind that is-

Roadmap37:05

Swyx37:20

Six years old? Wait, hang on. Six years old? Oh, because it, you're-- it was a CLI before AI and stuff. Yeah.

Ahmad Awais37:25

Yeah. Like, I've been building it for like, since twenty twenty, right? Uh, it became a product last year, and I told everybody and their, you know, brothers and brothers and that this will never be my product, right? So funny how, uh, you know, things work out. E- uh, anyway, so we are open sourcing it. The idea behind that is I want to make CommandCode completely hackable. Uh, I've been working very hard on that. Um, my background is also in this thing. Like, I've spent like thirteen years in WordPress Core. Uh, Matt Mullenweg is actually one of our angels now. When he heard that we are open sourcing CommandCode, he reached out. So it's, it's-- the idea is you should be able to modify any part of CommandCode irrespective of where our business model is headed, right? And the other idea that I am super subscribed to right now is that we will not turn this into a soup of, you know, fifteen hundred, you know, models that you decide what you do with them, right? I think of it like, you know, so there are, there are like three different philosophies, right? One is it's like Windows and every game works with it. You know, I think OpenCode is like that. Every model works with OpenCode. One is like Linux, where you build your own drivers, like Py. You can build anything with Py, right? Uh, with CommandCode, what I'm thinking and what the team

kind of is going forward towards is we're gonna build it like Apple. It will have the best of the best models, both open and closed. It will not have every model, but it will be hackable in any way, so you would be able to put in your local model if you wanted to or whatever, right? So that's where we are headed. Like, we're about to open source it very, very quickly, uh, very, very soon. So yeah, pretty excited about that.

Swyx39:11

Okay. Excellent. Looking forward to that. Open source is always good. It's all a competition, I would say. Uh, you know, I, I, it-- I was reminded that actually DeepSeek announced that they're gonna do DeepSeek code at some point. So, uh, but you know, obviously they'll only do DeepSeek stuff.

Ahmad Awais39:25

So many of the comments on this particular thing were about like, you know, uh, they were tagging DeepSeek people. "Why are you not building your own coding agent?"

Swyx39:34

They are. They are. Yeah, they announced recently.

Ahmad Awais39:35

And then they, yeah, they did-- after that, like, you know, maybe a week later or something, they announced that we are hiring for that or whatnot. So it makes sense, right? But they will only be for DeepSeek. Qwen two, uh, three point seven Max is the second-most used model on CommandCode right now. It's just two or three days old, right? So yeah.

Swyx39:55

Cool.

Ahmad Awais39:55

We'll see how it goes.

Swyx39:56

We'll see how it goes. Okay. Well, all the best. Uh, it's great to catch up and get a sense of like, you know, everything going on. Uh, I, I, I def-definitely think there's some, like, good ideas here that people can take regardless of what project they work on. So I think, uh, that's why we do the podcast is, uh, learning from everyone and try to have a common, I guess, shared knowledge of all these things.

Ahmad Awais40:18

Most definitely, and I'll continue sharing more and more. I think everybody can learn. Uh, it's not just, you know, you don't have to use this ID and CommandCode. You can just use it in any coding harness that you want, right? As far as we are all improving, right?

Swyx40:32

Yeah. Okay. That's it.

Ahmad Awais40:35

Awesome.

⚡️Making DeepSeek v4 outperform Opus 4.7 with Taste — @Ahmad Awais , CommandCode.ai

Topics

Mentioned

Transcript

Intro0:00

Background1:12

Taste Framework2:51

Tool Calling Repair4:48

Coding Harness Issues12:04

Go Plan16:23

Design Slop Fix17:35

Taste vs Skills26:52

Roadmap37:05