That is already the case for me. The amount of times I've read "apologies for th...

autobodie · on June 26, 2025

In my experience, LLMs are extremely inclined to modify code just to pass tests instead of meeting requirements.

fwip · on June 26, 2025

When they're not modifying the tests to match buggy behavior. :P

devjab · on June 26, 2025

Are you using the LLM's through a browser chatbot? Because the AI-agents we use with direct code-access aren't very chatty. I'd also argue that they are more capable than a lot of junior programmers, at least around here. We're almost at a point where you can feed the agents short specific tasks, and they will perform them well enough to not really require anything outside of a code review.

That being said, the prediction engine still can't do any real engineering. If you don't specifically task them with using things like Python generators, you're very likely to have a piece of code that eats up a gazillion memory. Which unfortunately don't set them appart from a lot of Python programmers I know, but it is an example of how the LLM's are exactly as bad as you mention. On the positive side, it helps with people actually writing the specification tasks in more detail than just "add feature".

Where AI-agents are the most useful for us is with legacy code that nobody prioritise. We have a data extractor which was written in the previous millennium. It basically uses around two hunded hard-coded coordinates to extact data from a specific type of documents which arrive by fax. It's worked for 30ish years because the documents haven't changed... but it recently did, and it took co-pilot like 30 seconds to correct the coordinates. Something that would've likely taken a human a full day of excruciating boredom.

I have no idea how our industry expect anyone to become experts in the age of vibe coding though.

furyofantares · on June 26, 2025

> Because the AI-agents we use with direct code-access aren't very chatty.

Every time I tell claude code something it did is wrong, or might be wrong, or even just ask a leading question about a potential bug it just wrote, it leads with "You're absolutely correct!" before even invoking any tools.

Maybe you've just become used to ignoring this. I mostly ignore it but it is a bit annoying when I'm trying to use the agent to help me figure out if the code it wrote is correct, so I ask it some question it should be capable of helping with and it leads with "you're absolutely correct".

I didn't make a proposition that can be correct or not, and it didn't do any work yet to to investigate my question - it feels like it has poisoned its own context by leading with this.

devjab · 2025-06-27T07:23:32 1751009012

It may have to do with workflow. I rarely talk with the AI agent, I task it with a VIBE.md or a specific outlined prompt that relates to inline "COPILOT: ...." comments, and then I review the changes and either keep or dismiss them. When I dismiss them I'll mostly rewrite the promt and do it again in a new context window.

I did get curious though. So I decided to look up some of the times where I did correct it after I dismissed a change. I only looked at a couple of prompts but most of the AI responses looked like this:

"There are two issues...", "The error is because...", "The error persists because...", "A new route, /class_ids/fully_owned, has been added...".

I was feeling confident that it wasn't bullshitting me at that point, but then I get to this one:

"Thank you for the details. The error response..."

Now, that is the AI agent. If I use the browser or one of their "apps" the LLM politeness encouragement bullshit alone will often be longer than the entire chat response in an agent. Like this is the entire response to what it was tasked with in my example:

"To add this route, I'll implement a new endpoint that queries all unique class_id values and checks if all items with that class_id have is_owned == True. The route will return a list of such class_id values.

I'll add this as a new GET route, e.g., /class_ids/fully_owned."

ChatGPT (which is supposedly the same model) will spend those lines telling me what a great question it was.

steveklabnik · 2025-06-27T17:48:55 1751046535

I put some stuff in Claude.md to tell it to chill out, and that helps. If you want it to communicate with you in a given style, tell it to do that.

gibspaulding · on June 26, 2025

> Where AI-agents are the most useful for us is with legacy code

I’d love to hear more about your workflow and the code base you’re working in. I have access to Amazon Q (which it looks like is using Claude Sonnet 4 behind the scenes) through work, and while I found it very useful for Greenfield projects, I’ve really struggled using it to work on our older code bases. These are all single file 20,000 to 100,000 line C modules with lots of global variables and most of the logic plus 25 years of changes dumped into a few long functions. It’s hard to navigate for a human, but seems to completely overwhelm Q’s context window.

Do other Agents handle this sort of scenario better, or are there tricks to making things more manageable? Obviously re-factoring to break everything up into smaller files and smaller functions would be great, but that’s just the sort of project that I want to be able to use the AI for.

devjab · 2025-06-27T07:06:24 1751007984

We use co-pilot through our azure license in VSC. My personal workflow is that I'll write a VIBE.md with very specific information on what I want and what I rexpect. Then in the actual code file I'll add a comment like "COPILOT: this is where I want you to do X". I'll then grant the agent access to the necessary files for the context. With big files it gets trickier because the prediction engine fails to distinguish between relevant and irrelevant context. I have the most success with incremental changes where the agent has to do one task at a time, and you can outline that in the VIBE.md + the comments where you add "COPILOT: This is step X...". In my coordinate example it actually had to change quite a lot of things, but that is still what I consider one task.

Context size matters a lot in my experience, but I'm not sure if it matters whether your 100k lines are in a single or multiple files. I tend to cut down what I feed the agent to the actual context, so if I have a 100k line file, but only 3000 lines matter, then I'll only feed those 3000 lines to the AI. Even in a couple of small files with maybe 200 lines of code in total, I'll only give the AI access to the 40 line which is the context it needs to work on.

English isn't my first language, so when I say context, what I mean is everything which is related to the change I want the agent to do. I will use SQLC as an example. Even though I feed the AI the Go model generated, I'll also give it access to the raw SQL file.

> Obviously re-factoring to break everything up into smaller files and smaller functions would be great, but that’s just the sort of project that I want to be able to use the AI for.

I'm guessing here, but I think part of our success is also our YAGNI approach. AI seem to have an easier time with something like Go where everything is explicit, everything is functions and Go modules live in isolation. Similarily AI will do much better with Python that is build with dataclasses and functions, and struggle with Python that is build upon more traditional OOP hierarchies. We've also had very little success with agents on C#. I have no idea whether that is because of C#'s inherrent implicity and "black magic" or because of the .net > .net core > .net framework > .net + whatever I forgot journey confusing the prediction engine.

> Do other Agents handle this sort of scenario better

I don't know. I've only used the sanctioned co-pilot agent professionally. I believe that is a GPT-4 model, but I'm not exactly sure on the details. For personal projects I use both the free version of GPT-4 in co-pilot and Claude Sonnet 4, and I haven't noticed much of a difference, but I have no hobby projects which are compareable.

teeray · on June 26, 2025

> Because the AI-agents we use with direct code-access aren't very chatty.

So they’re even more confident in their wrongness

devjab · 2025-06-27T07:08:11 1751008091

When there is less chat they appear less confident, but I think you're pretty spot on in point out why they are dangerous. If you're using them in an area where you're an expert, it's very easy to review the "diff" suggestion they come up with and decide whether it's bullshit or not. If you're using them in an area where you're not an expert, then how will you know?

axegon_ · 2025-06-27T11:18:48 1751023128

Of course I don't have any extensions locally, I am not a lunatic. I don't always have access to my personal hardware and I would never trust an extension to pass my code around over http to a server I don't have full control of. Ddosecrets should have been enough of a warning for most people but I suspect countless more will have to learn that lesson the hard way.

mexicocitinluez · on June 26, 2025

> 8 or 9 out of 10 times.

Not they don't. This is 100% a made up statistic.

bluefirebrand · on June 26, 2025

It isn't even being presented as a statistic it is someone saying what they have experienced

mexicocitinluez · 2025-06-27T14:03:08 1751032988

not to respond again but whats even funnier is that IT IS a statistic. saying "90% of the time I get a certain response" is literally a statistic.

mexicocitinluez · 2025-06-27T12:44:11 1751028251

nice. i'll add this to the list of "totally useless replies".