_______ __ _______ | | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----. | || _ || __|| < | -__|| _| | || -__|| | | ||__ --| |___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____| on Gopher (inofficial) URI Visit Hacker News on the Web COMMENT PAGE FOR: URI Two things LLM coding agents are still bad at mohsen1 wrote 35 min ago: > LLMs are terrible at asking questions Not if they're instructed to. In my experience you can adjust the prompt to make them ask questions. They ask very good questions actually! Plough_Jogger wrote 53 min ago: Let's just change the title to "LLM coding agents don't use copy & paste or ask clarifying questions" and save everyone the click. odkral wrote 56 min ago: If I need an exact copy pasting, I indicate that couple times in the prompt and it (claude) actually does what I am asking. But yeah overall very bad at refactoring big chunks. SamDc73 wrote 58 min ago: For 2) I feel like codex-5 kind of attempted to address this problem, with codex it usually asks a lot of questions and give options before digging in (without me prompting it to). For copy-paste, you made it feel like a low-hanging fruit? Why don't AI agents have copy/paste tools? joshribakoff wrote 1 hour 11 min ago: My human fixed a bug by introducing a new one. Classic. Meanwhile, I write the lint rules, build the analyzers, and fix 500 errors before theyâve finished reading Stack Overflow. Just donât ask me to reason about their legacy code â Iâm synthetic, not insane. â Just because this new contributor is forced to effectively âSSHâ into your codebase and edit not even with vim but with with sed and awk does not mean that this contributor is incapable of using other tools if empowered to do so. The fact it is able to work within such constraints goes to show how much potential there is. It is already much better at a human than erasing the text and re-typing it from memory and while it is a valid criticism that it needs to be taught how to move files imagine what it is capable of once it starts to use tools effectively. â Recently, I observed LLMs flail around for hours trying to get our e2e tests running as it tried to coordinate three different processes in three different terminals. It kept running commands in one terminal try to kill or check if the port is being used in the other terminal. However, once I prompted the LLM to create a script for running all three processes concurrently, it is able to create that script, leverage it, and autonomously debug the tests now way faster than I am able to. It has also saved any new human who tries to contribute from similar hours of flailing around. Is there something we could have easily done by hand but just never had the time to do before LLMs. If anything, the LLM is just highlighting the existing problem in our codebase that some of us got too used to. So yes, LLMs makes stupid mistakes, but so do humans, the thing is that LLms can ifentify and fix them faster (and better, with proper steering) strangescript wrote 1 hour 18 min ago: You don't want your agents to ask questions. You are thinking too short term. Its not ideal now, but agents that have to ask frequent questions are useless when it comes the vision of totally autonomous coding. Humans ask questions of groups to fix our own personal short comings. It make no sense to try and master an internal system I rarely use, I should instead ask someone that maintains it. AI will not have this problem provided we create paths of observability for them. It doesn't take a lot of "effort" for them to completely digest an alien system they need to use. justonceokay wrote 1 hour 13 min ago: If you look at a piece of architecture, you might be able to infer the intentions of the architect. However, there are many interpretations possible. So if you were to add an addendum to the building it makes sense that you might want to ask about the intentions. I do not believe that AI will magically overcome the Chesterton Fence problem in a 100% autonomous way. causal wrote 1 hour 26 min ago: Similar to the copy/paste issue I've noticed LLMs are pretty bad at distilling large documents into smaller documents without leaving out a ton of detail. Like maybe you have a super redundant doc. Give it to an LLM and it won't just deduplicate it, it will water the whole thing down. justinhj wrote 1 hour 26 min ago: Building an mcp tool that has access to refactoring operations should be straightforward and using it appropriately is well within the capabilities of current models. I wonder if it exists? I don't do a lot of refactoring with llm so haven't really had this pain point. majora2007 wrote 1 hour 36 min ago: I think LLMs provide value, used it this morning to fix a bug in my PDF Metadata parser without having to get too deep into the PDF spec. But most of the time, I find that the outputs are nowhere near the effect of just doing it myself. I tried Codex Code the other day to write some unit tests. I had a few setup and wanted to use it (because mocking the data is a pain). It took about 8 attempts, I had to manually fix code, it couldn't understand that some entities were obsolete (despite being marked and the original service not using them). Overall, was extremely disappointed. I still don't think LLMs are capable of replacing developers, but they are great at exposing knowledge in fields you might not know and help guide you to a solution, like Stack Overflow used to do (without the snark). ojosilva wrote 27 min ago: I think LLMs have what it takes at this point in time, but it's the coding agent (combined with the model) that make the magic happen. Coding agents can implement copy-pasting, it's a matter of building the right tool for it, then iterating with given models/providers, etc. And that's true for everything else that LLMs lack today. Shortcomings can be remediated with good memory and context engineering, safety-oriented instructions, endless verification and good overall coding agent architecture. Also having a model that can respond fast, have a large context window and maintain attention to instructions is also essential for a good overall experience. And the human prompting, of course. It takes good sw engineering skills, particularly knowing how to instruct other devs in getting the work done, setting up good AGENTS.md (CLAUDE.md, etc) with codebase instructions, best practices, etc etc. So it's not an "AI/LLMs are capable of replacing developers"... that's getting old fast. It's more like, paraphrasing the wise "it's not what your LLM can do for you, but what can you do for your LLM" simonw wrote 1 hour 37 min ago: I feel like the copy and paste thing is overdue a solution. I find this one particularly frustrating when working directly with ChatGPT and Claude via their chat interfaces. I frequently find myself watching them retype 100+ lines of code that I pasted in just to make a one line change. I expect there are reasons this is difficult, but difficult problems usually end up solved in the end. danenania wrote 47 min ago: Yeah, Iâve always wondered if the models could be trained to output special reference tokens that just copy verbatim slices from the input, perhaps based on unique prefix/suffix pairs. Would be a dramatic improvement for all kinds of tasks (coding especially). rhetocj23 wrote 1 hour 25 min ago: Whats the time horizon for said problems to be solved? Because guess what - time is running and people will not continue to aimlessly provide money at this stuff. simonw wrote 42 min ago: I don't see this one as an existential crisis for AI tooling, more of a persistent irritation. AI labs already shipped changes related to this problem - most notable speculative decoding, which lets you provide the text you expect to see come out again and speeds it up: [1] They've also been iterating on better tools for editing code a lot as part of the competition between Claude Code and Codex CLI and other coding agents. Hopefully they'll figure out a copy/paste mechanism as part of that work. URI [1]: https://simonwillison.net/2024/Nov/4/predicted-outputs/ TrackerFF wrote 1 hour 42 min ago: I very much agree on point 2. I often wish that instead of just starting to work on the code, automatically, even if you hit enter / send by accident, the models would rather ask for clarification. The models assume a lot, and will just spit out code first. I guess this is somewhat to lower the threshold for non-programmers, and to instantly give some answer, but it does waste a lot of resources - I think. Others have mentioned that you can fix all this by providing a guide to the mode, how it should interact with you, and what the answers should look like. But, still, it'd be nice to have it a bit more human-like on this aspect. nc wrote 1 hour 42 min ago: Add to this list, ability to verify correct implementation by viewing a user interface, and taking a holistic code-base / interface-wide view of how to best implement something. capestart wrote 1 hour 44 min ago: Large language models can help a lot, yet they still lack the human touch, particularly in the areas of context comprehension and question formulation. The entire "no copy-paste" rule seems strange as well. It is as if the models were performing an operation solely in their minds rather than just repeating it like we do. It gives the impression that they are learning by making mistakes rather than thinking things through. They are certainly not developers' replacements at this point! jamesjyu wrote 1 hour 56 min ago: For #2, if you're working on a big feature, start with a markdown planning file that you and the LLM work on until you are satisfied with the approach. Doesn't need to be rocket science: even if it's just a couple paragraphs it's much better than doing it one shot. linsomniac wrote 1 hour 59 min ago: >Sure, you can overengineer your prompt to try get them to ask more questions That's not overengineering, that's engineering. "Ask clarifying questions before you start working", in my experience, has led to some fantastic questions, and is a useful tool even if you were to not have the AI tooling write any code. As a good programmer, you should know when you are handing the tool a complete spec to build the code and when the spec likely needs some clarification, so you can guide the tool to ask when necessary. manmal wrote 47 min ago: You can even tell it how many questions to ask. For complex topics, I might ask it to ask me 20 or 30 questions. And I'm always surprised how good those are. You can also keep those around as a QnA file for later sessions or other agents. segmondy wrote 1 hour 59 min ago: Someone has definitely fallen behind and has massive skill issues. Instead of learning you are wasting time writing bad takes on LLM. I hope most of you don't fall down this hole, you will be left behind. mehdibl wrote 2 hours 7 min ago: You can do copy and paste if you offer it a tool/MCP that do that. It's not complicated using either function extraction with AST as target or line numbers. Also if you want it to pause asking questions, you need to offer that thru tools (example Manus do that) and I have an MCP that do that and surprisingly I got a lot of questions and if you prompt, it will do. But the push currently is for full automation and that's why it's not there. We are far better in supervised step by step mode. There is elicitation already in MCP, but having a tool asking questions require you have a UI that will allow to set the input back. ravila4 wrote 2 hours 11 min ago: Regarding copy-paste, Iâve been thinking the LLM could control a headless Neovim instance instead. It might take some specialized reinforcement learning to get a model that actually uses Vim correctly, but then it could issue precise commands for moving, replacing, or deleting text, instead of rewriting everything. Even something as simple as renaming a variable is often safer and easier when done through the editorâs language server integration. enraged_camel wrote 2 hours 27 min ago: First point is very annoying, yes, and it's why for large refactors I have the AI write step-by-step instructions and then do it myself. It's faster, cheaper and less error-prone. The second point is easily handled with proper instructions. My AI agents always ask questions about points I haven't clarified, or when they come across a fork in the road. Frequently I'll say "do X" and it'll proceed, then halfway it will stop and say "I did some of this, but before I do the rest, you need to decide what to do about such and such". So it's a complete non-problem for me. pengfeituan wrote 2 hours 29 min ago: The first issue is related to the inner behavior of LLMs. Human can ignore some detailed contents of code and copy and paste, but LLM convert them into hidden states. It is a process of compression. And the output is a process of decompression. And something maybe lost. So it is hard for LLM to copy and paste. The agent developer should customize the edit rules to do this. The second issue is that, LLM does not learn much high level context relationship of knowledge. This can be improved by introducing more patterns in the training data. And current LLM training is doing much on this. I don't think it is a problem in next years. 8s2ngy wrote 2 hours 39 min ago: One thing LLMs are surprisingly bad at is producing correct LaTeX diagram code. Very often I've tried to describe in detail an electric circuit, a graph (the data structure), or an automaton so I can quickly visualize something I'm studying, but they fail. They mix up labels, draw without any sense of direction or ordering, and make other errors. I find this surprising because LaTeX/TiKZ have been around for decades and there are plenty of examples they could have learned from. celeritascelery wrote 2 hours 42 min ago: The âLLMs are bad at asking questionsâ is interesting. There are some times that I will ask the LLM to do something without giving it All the needed information. And rather than telling me that something's missing or that it can't do it the way I asked, it will try and do a halfway job using fake data or mock something out to accomplish it. What I really wish it would do is just stop and say, âhey, I can't do it like you asked Did you mean this?â squirrel wrote 2 hours 53 min ago: A friendly reminder that "refactor" means "make and commit a tiny change in less than a few minutes" (see links below). The OP and many comments here use "refactor" when they actually mean "rewrite". I hear from my clients (but have not verified myself!) that LLMs perform much better with a series of tiny, atomic changes like Replace Magic Literal, Pull Up Field, and Combine Functions Into Transform. [1] [2] URI [1]: https://martinfowler.com/books/refactoring.html URI [2]: https://martinfowler.com/bliki/OpportunisticRefactoring.html URI [3]: https://refactoring.com/catalog/ Lerc wrote 2 hours 54 min ago: I think the issue with them making assumptions and failing to properly diagnose issues comes more from fine-tuning than any particular limitation in LLMs themselves. When fine tuned on a set of problem->solution data it kind of carries the assumption that the problem contains enough data for the solution. What is really needed is a tree of problems which appear identical at first glance, but the issue and the solution is something that is one of many possibilities which can only be revealed by finding what information is lacking, acquiring that information, testing the hypothesis then, if the hypothesis is shown to be correct, then finally implementing the solution. That's a much more difficult training set to construct. The editing issue, I feel needs something more radical. Instead of the current methods of text manipulation, I think there is scope to have a kind of output position encoding for a model to emit data in a non-sequential order. Again this presents another training data problem, there are limited natural sources to work from showing programming in the order a programmer types it. On the other hand I think it should be possible to do synthetic training examples by translating existing model outputs that emit patches, search/replaces, regex mods etc. and translate those to a format that directly encodes the final position of the desired text. At some stage I'd like to see if it's possible to construct the models current idea of what the code is purely by scanning a list of cached head_embeddings of any tokens that turned into code. I feel like there should be enough information given the order of emission and the embeddings themselves to reconstruct a piecemeal generated program. imcritic wrote 3 hours 9 min ago: About the first point mentioned in article: could that problem be solved simply by changing the task from something like "refactor this code" to something like "refactor this code as a series of smaller atomic changes (like moving blocks of code or renaming variable references in all places), disable suitable for git commits (and provide git message texts for those commits)"? crazygringo wrote 3 hours 55 min ago: > Sure, you can overengineer your prompt to try get them to ask more questions (Roo for example, does a decent job at this) -- but it's very likely still won't. Not in my experience. And it's not "overengineering" your prompt, it's just writing your prompt. For anything serious, I always end every relevant request with an instruction to repeat back to me the full design of my instructions or ask me necessary clarifying questions first if I've left anything unclear, before writing any code. It always does. And I don't mind having to write that, because sometimes I don't want that. I just want to ask it for a quick script and assume it can fill in the gaps because that's faster. notpachet wrote 3 hours 57 min ago: > They keep trying to make it work until they hit a wall -- and then they just keep banging their head against it. This is because LLMs trend towards the centre of the human cognitive bell curve in most things, and a LOT of humans use this same problem solving approach. gessha wrote 3 hours 43 min ago: The approach doesnât matter as much. The halting problem does :) NumberCruncher wrote 4 hours 4 min ago: I donât really understand why thereâs so much hate for LLMs here, especially when it comes to using them for coding. In my experience, the people who regularly complain about these tools often seem more interested in proving how clever they are than actually solving real problems. They also tend to choose obscure programming languages where itâs nearly impossible to hire developers, or they spend hours arguing over how to save $20 a month. Over time, they usually get what they want: they become the smartest ones left in the room, because all the good people have already moved on. Whatâs left behind is a codebase no one wants to work on, and you canât hire for it either. But maybe Iâve just worked with the wrong teams. EDIT: Maybe this is just about trust. If you canât bring yourself to trust code written by other human beings, whether itâs a package, a library, or even your own teammates, then of course youâre not going to trust code from an LLM. But thatâs not really about quality, itâs about control. And the irony is that people who insist on controlling every last detail usually end up with fragile systems nobody else wants to touch, and teams nobody else wants to join. kakacik wrote 3 hours 49 min ago: It has been discussed ad nauseum. It demolishes learning curve all of us with decade(s) of experience went through, to become seniors we are. Its not a function of age, not a function of time spent staring at some screen or churning our basic crud apps, its function of hard experience, frustration, hard won battles, grokking underlying technologies or algorithms. Llms provide little of that, they make people lazy, juniors stay juniors forever, even degrading mentally in some aspects. People need struggle to grow, when you have somebody who had his hand held whole life they are useless human disconnected from reality, unable to self-sufficiently achieve anything significant. Too easy life destroys both humans and animals alike (many experiments have been done on that, with damning results). There is much more like hallucinations, questionable added value of stuff that confidently looks OK but has underlying hard-to-debug bugs but above should be enough for a start. I suggest actually reading those conversations, not just skimming through them, this has been stated countless times. tossandthrow wrote 3 hours 55 min ago: I regularly check in on using LLMs. But a key criteria for me is that an LLM needs to objectively make me more efficient, not subjectively. Often I find myself cursing at the LLM for not understanding what I mean - which is expensive in lost time / cost of tokens. It is easy to say: Then just don't use LLMs. But in reality, it is not too easy to break out of these loops of explaining, and it is extremely hard to assess when not to trust that the LLM will not be able to finish the task. I also find that LLMs consistently don't follow guidelines. Eg. to never use coercions in TypeScript (It always gets in a rogue `as` somewhere) - to which I can not trust the output and needs to be extra vigilant reviewing. I use LLMs for what they are good at. Sketching up a page in React/Tailwind, sketching up a small test suite - everything that can be deemed a translation task. I don't use LLMs for tasks that are reasoning heavy: Data modelling, architecture, large complex refactors - things that require deep domain knowledge and reasoning. NumberCruncher wrote 3 hours 47 min ago: > Often I find myself cursing at the LLM for not understanding what I mean... Me too. But in all these cases, sooner or later, I realized I made a mistake not giving enough context and not building up the discussion carefully enough. And I was just rushing to the solution. In the agile world, one could say I gave the LLM not a well-defined story, but a one-liner. Who is to blame here? I still remember training a junior hire who started off with: âSorry, I spent five days on this ticket. I thought it would only take two. Also, whoâs going to do the QA?â After 6 months or so, the same person was saying: âI finished the project in three weeks. I estimated four. QA is done. Ready to go live.â At that point, he was confident enough to own his work end-to-end, even shipping to production without someone else reviewing it. Interestingly, this colleague left two years ago, and I had to take over his codebase. Itâs still running fine today, and Iâve spent maybe a single day maintaining it in the last two years. Recently, I was talking with my manager about this. We agreed that building confidence and self-checking in a junior dev is very similar to how you need to work with LLMs. Personally, whenever I generate code with an LLM, I check every line before committing. I still donât trust it as much as the people I trained. mcny wrote 4 hours 55 min ago: I sometimes give LLM random "easy" questions. My assessment is still that they all need the fine print "bla bla can be incorrect" You should either already know the answer or have a way to verify the answer. If neither, the matter must be inconsequential like just a child like curiosity. For example, I wonder how many moons Jupiter has... It could be 58, it could be 85 but either answer won't alter any of what I do today. I suspect some people (who need to read the full report) dump thousand page long reports into LLM, read the first ten words of the response and pretend they know what the report says and that is scary. mexicocitinluez wrote 3 hours 25 min ago: > or have a way to verify the answer Fortunately, as devs, this is our main loop. Write code, test, debug. And it's why people who fear AI-generated code making it's way into production and causing errors makes me laugh. Are you not testing your code? Or even debugging it? Like, what process are you using that prevents bugs happening? Guess what? It's the exact same process with AI-generated code. latexr wrote 3 hours 32 min ago: > For example, I wonder how many moons Jupiter has... It could be 58, it could be 85 For those curious, the answer is 97. URI [1]: https://en.wikipedia.org/wiki/Moons_of_Jupiter mr_mitm wrote 4 hours 59 min ago: The other day, I needed Claude Code to write some code for me. It involved messing with the TPM of a virtual machine. For that, it was supposed to create a directory called `tpm_dir`. It constantly got it wrong and wrote `tmp_dir` instead and tried to fix its mistake over and over again, leading to lots of weird loops. It completely went off the rails, it was bizarre. hotpotat wrote 5 hours 3 min ago: Lol this person talks about easing into LLMs again two weeks after quitting cold turkey. The addiction is real. I laugh because Iâm in the same situation, and see no way out other than to switch professions and/or take up programming as a hobby in which I purposefully subject myself to hard mode. Iâm too productive with it in my profession to scale back and do things by hand â the cat is out of the bag and Iâve set a race pace at work that I canât reasonably retract from without raising eyebrows. So I agree with the authorâs referenced post that finding ways to still utilize it while maintaining a mental map of the code base and limiting its blast radius is a good middle ground, but damn it requires a lot of discipline. mallowdram wrote 3 hours 41 min ago: cat out of the bag is disautomation. the speed in the timetable is an illusion if the supervision requires blast radius retention. this is more like an early video game assembly line than a structured skilled industry schwartzworld wrote 5 hours 1 min ago: > Iâve set a race pace at work that I canât reasonably retract from without raising eyebrows Why do this to yourself? Do you get paid more if you work faster? hotpotat wrote 4 hours 38 min ago: It started as a mix of self-imposed pressure and actually enjoying marking tasks as complete. Now I feel resistant to relaxing things. And no, I definitely donât get paid more. cadamsdotcom wrote 5 hours 36 min ago: You need good checks and balances. E2E tests for your happy path, TDD when you & your agent write code. Then you - and your agent - can refactor fearlessly. mihau wrote 6 hours 46 min ago: > you can overengineer your prompt to try get them to ask more questions why overengineer? it's super simple I just do this for 60% of my prompts: "{long description of the feature}, please ask 10 questions before writing any code" amelius wrote 7 hours 3 min ago: I recently asked an llm to fix an Ethernet connection while I was logged into the machine through another. Of course, I explicitly told the llm to not break that connection. But, as you can guess, in the process it did break the connection. If an llm can't do sys admin stuff reliably, why do we think it can write quality code? arbirk wrote 7 hours 54 min ago: Those 2 things are not inherit to LLM's and could easily be changed by giving it the proper tools and instructions podgorniy wrote 7 hours 57 min ago: > LLMs are terrible at asking questions. They just make a bunch of assumptions _Did you ask it to ask questions?_ _ink_ wrote 7 hours 59 min ago: > LLMs are terrible at asking questions. They just make a bunch of assumptions and brute-force something based on those guesses. I don't agree with that. When I am telling Claude Code to plan something I also mention that it should ask questions when informations are missing. The questions it comes up with a really good, sometimes about cases I simply didn't see. To me the planning discussion doesn't feel much different than in a GitLab thread, only at a much higher iteration speed. BenGosub wrote 8 hours 0 min ago: The issue is partly that some expect a fully fledged app or a full problem solution, while others want incremental changes. To some extent this can be controlled by setting the rules in the beginning of the conversation. To some extent, because the limitations noted in the blog still apply. janmarsal wrote 8 hours 23 min ago: My biggest issue with LLMs right now is that they're such spineless yes men. Even when you ask their opinion on if something is doable or should it be done in the first place, more often than not they just go "Absolutely!" and shit out a broken answer or an anti-pattern just to please you. Not always, but way too often. You need to frame your questions way too carefully to prevent this. Maybe some of those character.ai models are sassy enough to have stronger opinions on code? cheema33 wrote 8 hours 30 min ago: From the article: > I contest the idea that LLMs are replacing human devs... AI is not able to replace good devs. I am assuming that nobody sane is claiming such a thing today. But, it can probably replace bad and mediocre devs. Even today. In my org we had 3 devs who went through a 6-month code boot camp and got hired a few years ago when it was very difficult to find good devs. They struggled. I would give them easy tasks and then clean up their PRs during review. And then AI tools got much better and it started outperforming these guys. We had to let two go. And third one quit on his own. We still hire devs. But have become very reluctant to hire junior devs. And will never hire someone from a code boot camp. And we are not the only ones. I think most boot camps have gone out of business for this reason. Will AI tools eventually get good enough to start replacing good devs? I don't know. But the data so far shows that these tools keep getting better over time. Anybody who argues otherwise has their heads firmly stuck in sand. In the early US history approximately 90% of the population was involved in farming. Over the years things changed. Now about 2% has anything to do with farming. Fewer people are farming now. But we have a lot more food and a larger variety available. Technology made that possible. It is totally possible that something like that could happen to the software development industry as well. How fast it happens totally depends on how fast do the tools improve. Leynos wrote 7 hours 30 min ago: What do you think was the reason that the bootcamp grads struggling to get better at what they do? _1 wrote 1 hour 35 min ago: My experience with them is that the are taught to cover as much syntax and libraries as possible, without spending time learning how solve problems and develop their own algorithms. They (in general) expect to follow predefined recipes. cheema33 wrote 2 hours 57 min ago: A computer science degree in most US colleges takes about 4 years of work. Boot camps try to cram that into 6 months. All the while many students have other full-time jobs. This is simply not enough training for the students to start solving complex real world problem. Even 4 years is not enough. Many companies were willing to hire fresh college grads in the hopes that they could solve relatively easy problems for a few years, gain experience and become successful senior devs at some point. However, with the advent of AI dev tools, we are seeing very clear signs that junior dev hiring rates have fallen off a cliff. Our project manager, who has no dev experience, frequently assigns easy tasks/github issues to Github Copilot. Copilot generates a PR in a few minutes that other devs can review before merging. These PRs are far superior to what an average graduate of a code boot camp could ever create. Any need we had for a junior dev has completely disappeared. username223 wrote 58 min ago: > Any need we had for a junior dev has completely disappeared. Where do your senior devs come from? weakfish wrote 22 min ago: That's the question that has been stuck in my head as I read all these stories about junior dev jobs disappearing. I'm firmly mid-level, having started my career just before LLM coding took off. Sometimes it feels like I got on the last chopper out of Saigon. aragonite wrote 8 hours 31 min ago: Has anyone had success getting a coding agent to use an IDE's built-in refactoring tools via MCP especially for things like project-wide rename? Last time I looked into this the agents I tried just did regex find/replace across the repo, which feels both error-prone and wasteful of tokens. I haven't revisited recently so I'm curious what's possible now. olejorgenb wrote 5 hours 21 min ago: Serena MCP does this approach IIRC petesergeant wrote 6 hours 38 min ago: That's interesting, and I haven't, but as long as the IDE has an API for the refactoring action, giving an agent access to it as a tool should be pretty straightforward. Great idea. clayliu wrote 8 hours 41 min ago: âTheyâre still more like weird, overconfident interns.â Perfect summary. LLMs can emit code fast but they donât really handle code like developers do â thereâs no sense of spatial manipulation, no memory of where things live, no questions asked before moving stuff around. Until they can âcopy-pasteâ both code and context with intent, theyâll stay great at producing snippets and terrible at collaborating. furyg3 wrote 8 hours 28 min ago: This is exactly how we describe them internally: the smartest interns in the world. I think it's because the chat box way of interacting with them is also similar to how you would talk to someone who just joined a team. "Hey it wasn't what you asked me to do but I went ahead and refactored this whole area over here while simultaneously screwing up the business logic because I have no comprehension of how users use the tool". "Um, ok but did you change the way notifications work like I asked". "Yes." "Notifications don't work anymore". "I'll get right on it". senko wrote 8 hours 41 min ago: I'd argue LLM coding agents are still bad at many more things. But to comment on the two problems raised in the post: > LLMs donât copy-paste (or cut and paste) code. The article is confusing the architectural layers of AI coding agents. It's easy to add "cut/copy/paste" tools to the AI system if that shows improvement. This has nothing to do with LLM, it's in the layer on top. > Good human developers always pause to ask before making big changes or when theyâre unsure [LLMs] keep trying to make it work until they hit a wall -- and then they just keep banging their head against it. Agreed - LLMs don't know how to back track. The recent (past year) improvements in thinking/reasoning do improve in this regard (it's the whole "but wait..." RL training that exploded with OpenAI o1/o3 and DeepSeek R1, now done by everyone), but clearly there's still work to do. typpilol wrote 8 hours 26 min ago: Ask a model to show you the seahorse emojii and you'll get a storm of "but wait!" bad_username wrote 8 hours 44 min ago: LLMs are great at asking questions if you ask them to ask questions. Try it: "before writing the code, ask me about anything that is nuclear or ambiguous about the task". d1sxeyes wrote 8 hours 41 min ago: âIf you think Iâm asking you to split atoms, youâre probably wrongâ. SafeDusk wrote 8 hours 45 min ago: @kixpanganiban Do you think it will work if for refactoring tasks, we take aways OpenAI's `apply_patch` tool and just provide `cut` and `paste` for the first few steps? I can run this experiment using ToolKami[0] framework if there is enough interest or if someone can give some insights. [0]: URI [1]: https://github.com/aperoc/toolkami pammf wrote 8 hours 49 min ago: In Claude Code, it always shows the diff between current and proposed changes and I have to explicitly allow it to actually modify the code. Doesnât that âfixâ the copy-&-paste issue? nxpnsv wrote 8 hours 54 min ago: Codex has got me a few times lately, doing what I asked but certainly not what I intended: - Get rid of these warnings "...": captures and silences warnings instead of fixing them - Update this unit test to relfect the changes "...": changes the code so the outdated test works - The argument passed is now wrong: catches the exception instead of fixing the argument My advice is to prefer small changes and read everything it does before accepting anything, often this means using the agent actually is slower than just coding... d1sxeyes wrote 8 hours 25 min ago: You also have to be a bit careful: âFix the issues causing these warningsâ Retrospectively fixing a test to be passing given the current code is a complex task, instead, you can ask it to write a test that tests the intended behaviour, without needing to infer it. âThe argument passed is now wrongâ - youâre asking the LLM to infer that thereâs a problem somewhere else, and to find and fix it. When youâre asking an LLM to do something, you have to be very explicit about what you want it to do. nxpnsv wrote 1 hour 25 min ago: Exactly, I think the takeaway is that being careful when formulating a task is essential with LLMs. They make errors that wouldnât be expected when asking the same from a person. cat-whisperer wrote 8 hours 54 min ago: The copy-paste thing is interesting because it hints at a deeper issue: LLMs don't have a concept of "identity" for code blocksâthey just regenerate from learned patterns. I've noticed similar vibes when agents refactorâthey'll confidently rewrite a chunk and introduce subtle bugs (formatting, whitespace, comments) that copy-paste would've preserved. The "no questions" problem feels more solvable with better prompting/tooling though, like explicitly rewarding clarification in RLHF. stellalo wrote 8 hours 39 min ago: I feel like itâs the opposite: the copy-paste issue is solvable, you just need to equip the model with the right tools and make sure they are trained on tasks where thatâs unambiguously the right thing to do (for example, cases were copying code âby handâ would be extremely error prone -> leads to lower reward on average). On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale. saghm wrote 8 hours 23 min ago: > On the other hand, teaching the model to be unsure and ask questions, requires the training loop to break and bring a human input in, which appears more difficult to scale. The ironic thing to me is that the one thing they never seem to be willing to skip asking about is whether they should proceed with some fix that I just helped them identify. They seem extremely reluctant to actually ask about things they don't know about, but extremely eager to ask about whether they should do the things they already have decided they think are right! juped wrote 9 hours 2 min ago: It's apparently lese-Copilot to suggest this these days, but you can find very good hypothesizing and problem solving if you talk conversationally to Claude or probably any of its friends that isn't the terminally personality-collapsed SlopGPT (with or without showing it code, or diagrams); it's actually what they're best at, and often they're even less likely than human interlocutors to just parrot some set phrase at you. It's only when you take the tech out of the area it's good at and start trying to get it to "write code" or even worse "be an agent" that it starts cracking up and emitting garbage; this is only done because companies want to forcememe some kind of product besides "chatbot", whether or not it makes sense. It's a shame because it'll happily and effectively write the docs that don't exist but you wish did for more or less anything. (Writing code examples for docs is not a weak point at all.) rossant wrote 9 hours 3 min ago: Recently, I asked Codex CLI to refactor some HTML files. It didn't literally copy and pasted snippets here and there as I would have done myself, it rewrote them from memory, removing comments in the process. There was a section with 40 successive links with complex URLs. A few days later, just before deployment to production, I wanted to double check all 40 links. First one worked. Second one worked. Third one worked. Fourth one worked. So far so good. Then I tried the last four. Perfect. Just to be sure, I proceeded with the fifth one. 404. Huh. Weird. The domain was correct though and the URL seemed reasonable. I tried the other 31 links. ALL of them 404ed. I was totally confused. The domain was always correct. It seemed highly suspicious that all websites would have had moved internal URLs at the same time. I didn't even remember that this part of the code had gone through an LLM. Fortunately, I could retrieve the old URLs on old git commits. I checked the URLs carefully. The LLM had HALLUCINATED most of the path part of the URLs! Replacing things like domain.com/this-article-is-about-foobar-123456/ by domain.com/foobar-is-so-great-162543/... These kinds of very subtle and silently introduced mistakes are quite dangerous. Be careful out there! scottbez1 wrote 38 min ago: The last point I think is most important: "very subtle and silently introduced mistakes" -- LLMs may be able to complete many tasks as well (or better) than humans, but that doesn't mean they complete them the same way, and that's critically important when considering failure modes. In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change. When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary. For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for. FitchApps wrote 53 min ago: Reminds me when I asked Claude (through Windsurf) to create a S3 Lambda trigger to resize images (as soon as PNG image appears in S3, resize it). The code looked flawless and I deployed ..only to learn that I introduced a perpetual loop :) For every image resized, a new one would be created and resized. In 5 min, the trigger created hundreds of thousands of images ...what a joy was to clean that up in S3 polynomial wrote 56 min ago: Evals don't fix this. HardCodedBias wrote 53 min ago: Maybe they don't fix it, but I suspect that they move us towards it occurring less often. dkarl wrote 1 hour 42 min ago: I've had similar experience both in coding and in non-coding research questions. An LLM will do the first N right and fake its work on the rest. It even happens when asking an LLM to reformat a document, or asking it to do extra research to validate information. For example, before a recent trip to another city, I asked Gemini to prepare a list of brewery taprooms with certain information, and I discovered it had included locations that had been closed for years or had just been pop-ups. I asked it to add a link to the current hours for each taproom and remove locations that it couldn't verify were currently open, and it did this for about the first half of the list. For the last half, it made irrelevant changes to the entries and didn't remove any of the closed locations. Of course it enthusiastically reported that it had checked every location on the list. Romario77 wrote 1 hour 15 min ago: LLMs are not good at "cycles" - when you have to go over a list and do the same action on each item. It's like it has ADHD and forgets or gets distracted in the middle. And the reason for that is that LLMs don't have memory and process the tokens, so as they keep going over the list the context becomes bigger with more irrelevant information and they can lose the reason they are doing what they are doing. fwip wrote 53 min ago: It would be nice if the tools we usually use for LLMs had a bit more programmability. In this example, It we could imagine being able to chunk up work by processing a few items, then reverting to a previous saved LLM checkpoint of state, and repeating until the list is complete. I imagine that the cost of saving & loading the current state must be prohibitively high for this to be a normal pattern, though. radarsat1 wrote 5 min ago: Agreed. You basically want an LLM to have a tool that writes its own agent to accomplish a repetitive task. I think this is doable. dmoy wrote 55 min ago: Which is annoying because that is precisely the kind of boring rote programming tasks I want an LLM to do for me, to free up my time for more interesting problems polynomial wrote 56 min ago: So much for Difference and Repetition. mehdibl wrote 2 hours 5 min ago: Errors are normal and happen ofter. You need to focus on providing it ability to test the changes and fix errors. If you expect one shot you will get a lot of bad surprises. cpfohl wrote 2 hours 53 min ago: My custom prompt instructs GPT to output changes to code as a diff/git-patch. I donât use agents because it makes it hard to see whatâs happening and I donât trust them yet. ravila4 wrote 2 hours 22 min ago: Iâve tried this approach when working in chat interfaces (as opposed to IDEs), but I often find it tricky to review diffs without the full context of the codebase. That said, your comment made me realize I could be using âgit applyâmore effectively to review LLM-generated changes directly in my repo. Itâs actually a neat workflow! yodsanklai wrote 3 hours 43 min ago: 5 minutes ago, I asked Claude to add some debug statements in my code. It also silently changed a regex in the code. It was easily caught with the diff but can be harder to spot in larger changes. jihadjihad wrote 2 hours 29 min ago: I had a pretty long regex in a file that was old and crusty, and when I had Claude add a couple helpers to the file, it changed the formatting of the regex to be a little easier on the eyes in terms of readability. But I just couldn't trust it. The diff would have been no help since it went from one long gnarly line to 5 tight lines. I kept the crusty version since at least I am certain it works. alzoid wrote 2 hours 41 min ago: I asked Claude to add a debug endpoint to my hardware device that just gave memory information. It wrote 2600 lines of C that gave information about every single aspect of the system. On the one hand kind of cool. It looked at the MQTT code and the update code, the platform (esp) and generated all kinds of code. It recommended platform settings that could enable more detailed information that checked out when I looked at the docs. I ran it and it worked. On the other hand, most of the code was just duplicated over and over again ex: 3 different endpoints that gave overlapping information. About half of the code generated fake data rather than actually do anything with the system. I rolled back and re-prompted and got something that looked good and worked. The LLMs are magic when they work well but they can throw a wrench into your system that will cost you more if you don't catch it. I also just had a 'senior' developer tell me that a feature in one of our platforms was deprecated. This was after I saw their code which did some wonky hacky like stuff to achieve something simple. I checked the docs and said feature (URL Rewriting) was obviously not deprecated. When I asked how they knew it was deprecated they said Chat GPT told them. So now they are fixing the fix chat gpt provided. troupo wrote 1 hour 31 min ago: > About half of the code generated fake data rather than actually do anything with the system. All the time // fake data. in production this would be real data ... proceeds to write sometimes hundreds of lines of code to provide fake data stuartjohnson12 wrote 59 min ago: "hey claude, please remove the fake data and use the real data" "sure thing, I'll add logic to check if the real data exists and only use the fake data as a fallback in case the real data doesn't exist" weakfish wrote 33 min ago: This comment captures exactly what aggravates me about CC / other agents in a way that I wasn't sure how to express before. Thanks! alzoid wrote 44 min ago: I will also add checks to make sure the data that I get is there even though I checked 8 times already and provide loads of logging statements and error handling. Then I will go to every client that calls this API and add the same checks and error handling with the same messaging. Oh also with all those checks I'm just going to swallow the error at the entry point so you don't even know it happened at runtime unless you check the logs. That will be $1.25 please. smougel wrote 3 hours 45 min ago: Not related to code... But when I use a LLM to perform a kind of copy/paste, I try to number the lines and ask it to generate a start_index and stop_index to perform the slice operation. Much less hallucinations and very cheap in token generation. grafmax wrote 3 hours 53 min ago: Yeah this sort of thing is a huge time waster with LLMs. weinzierl wrote 4 hours 56 min ago: Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back. Just before sending I noticed that it had moved the event date by one day. Luckily I caught it but it taught me that you never should blindly trust LLM output even with super simple tasks, no relevant context size, clear and simple one sentence prompt. LLM's do the most amazing things but they also sometimes screw up the simplest of tasks in the most unexpected ways. nonethewiser wrote 1 hour 46 min ago: >Not code, but I once pasted an event announcement and asked for just spelling and grammar check. LLM suggested a new version with minor tweak which I copy pasted back. Just before sending I noticed that it had moved the event date by one day. This is the kind of thing I immediately noticed about LLMs when I used them for the first time. Just anecdotally, I'd say it had this problem 30-40% of the time. As time has gone on, it has gotten so much better. But it still makes this kind of problem -- lets just say -- 5% of the time. The thing is, it's almost more dangerous to rarely make the problem. Because now people aren't constantly looking for it. You have no idea if it's not just randomly flipping terms or injecting garbage unless you actually validate it. The ideal of giving it an email to improve and then just scanning the result before firing it off is terrifying to me. flowingfocus wrote 2 hours 29 min ago: A diff makes these kind of errors much easier to catch. Or maybe someone from XEROX has a better idea how to catch subtly altered numbers? mcpeepants wrote 2 hours 27 min ago: I verify all dates manually by memorizing their offset from the date of the signing of the Magna Carta Xss3 wrote 6 hours 28 min ago: This is a horror story about bad quality control practices, not the use of LLMs. __MatrixMan__ wrote 3 hours 17 min ago: I have a project that I've leaned heavily on LLM help for which I consider to embody good quality control practices. I had to get pretty creative to pull it off: spent a lot of time working on this sync system so that I can import sanitized production data into the project for every table it touches (there are maybe 500 of these) and then there's a bunch of hackery related to ensuring I can still get good test coverage even when some of these flows are partially specified (since adding new ones proceeds in several separate steps). If it was a project written by humans I'd say they were crazy for going so hard on testing. The quality control practices you need for safely letting an LLM run amok aren't just good. They're extreme. amelius wrote 7 hours 6 min ago: In these cases I explicitly tell the llm to make as few changes as possible and I also run a diff. And then I reiterate with a new prompt if too many things changed. globular-toast wrote 6 hours 41 min ago: You can always run a diff. But how good are people at reading diffs? Not very. It's the kind of thing you would probably want a computer to do. But now we've got the computer generating the diffs (which it's bad at) and humans verifying them (which they're also bad at). CaptainOfCoit wrote 5 hours 3 min ago: Yeah, pick one for you to do, the other for the LLMs to do, ideally pick the one you're better at, otherwise 50/50 you'll actually become faster. coldtea wrote 8 hours 27 min ago: >A few days later, just before deployment to production, I wanted to double check all 40 links. This was allowed to go to master without "git diff" after Codex was done? raffael_de wrote 7 hours 31 min ago: This and why are the URLs hardcoded to begin with? And given the chaotic rewrite by Codex it would probably be more work to untangle the diff than just do it yourself right away. rossant wrote 8 hours 23 min ago: It was a fairly big refactoring basically converting a working static HTML landing page into a Hugo website, splitting the HTML into multiple Hugo templates. I admit I was quite in a hurry and had to take shortcuts. I didn't have time to write automated tests and had to rely on manual tests for this single webpage. The diff was fairly big. It just didn't occur to me that the URLs would go through the LLMs and could be affected! Lesson learnt haha. indigodaddy wrote 32 min ago: This is why my instinct for this sort of task is, "write a script that I can use to do x y z," instead of "do x y z" cimi_ wrote 7 hours 19 min ago: Speaking of agents and tests, here's a fun one I had the other day: while refactoring a large code base I told the agent to do something precise to a specific module, refactor with the new change, then ensure the tests are passing. The test suite is slow and has many moving parts; the tests I asked it to run take ~5 minutes. The thing decided to kill the test run, then it made up another command it said was the 'tests' so when I looked at the agent console in the IDE everything seemed fine collapsed, i.e. 'Tests ran successfully'. Obviously the code changes also had a subtle bug that I only saw when pushing its refactoring to CI (and more waiting). At least there were tests to catch the problem. rossant wrote 3 hours 26 min ago: So it took a shortcut as it was too lazy and it lied to your face about it. AGI is here for good. tuesdaynight wrote 3 hours 40 min ago: I think that it's something that model providers don't want to fix, because the amount of times that Claude Code just decided to delete tests that were not passing before I added a memory saying that it would need to ask for my permission to do that was staggering. It stopped happening after the memory, so I believe that it could be easily fixed by a system prompt. Ezhik wrote 3 hours 19 min ago: Your Claude Code actually respects CLAUDE.md? exe34 wrote 8 hours 2 min ago: this is why I'm terrified of large LLM slop changesets that I can't check side by side - but then that means I end up doing many small changes that are harder to describe in words than to just outright do. hshdhdhehd wrote 8 hours 29 min ago: Well using an LLM is like rolling dice. Logits are probabilities. It is a bullshit machine. dude250711 wrote 6 hours 47 min ago: Yeah, it read like "when running with scissors be careful out there". How about not running with scissors at all? Unless of course the management says "from now on you will be running with scissors and your performance will increase as a result". hansmayer wrote 2 hours 45 min ago: And if you stab yourself in the stomach ... you must have sucked at running with the scissors :) worldsayshi wrote 8 hours 32 min ago: This is of course bad but: humans also makes (different) mistakes all the time. We could account for the risk of mistakes being introduced and make more tools that validate things for us. In a way LLM:s encourage us to do this by adding other vectors of chaos into our work. Like, why not have tools built into our environment that checks that links are not broken? With the right architecture we could have validations for most common mistakes without having the solution adding a bunch of tedious overhead. lenkite wrote 6 hours 37 min ago: In the above kind of described situation, a meticulous coder actually makes no mistakes. They will however make a LOT more mistakes if they use LLM's to do the same. I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year. When stuff like this was done by a plain, slow, organic human, it was far more accurate. And many times, completely accurate with no defects. Simply because many developers pay close attention when they are forced to do the manual labour themselves. Sure the refactoring commit is produced faster with LLM assistance, but repeatedly reviewing code and pointing out weird defects is very stressful. worldsayshi wrote 3 hours 13 min ago: I think it goes without saying that we need to be sceptical when to use and not use LLM. The point I'm trying to make is more that we should have more validations and not that we should be less sceptical about LLMs. Meticulousness shouldn't be an excuse to not have layers of validation that doesn't have to cost that much if done well. thunky wrote 3 hours 24 min ago: > I have already had to correct a LOT of crap similar to the above in refactoring-done-via-LLM over the last year The person using the LLM should be reviewing their code before submitting it to you for review. If you can catch a copy paste error like this, then so should they. The failure you're describing is that your coworkers are not doing their job. And if you accept "the LLM did that, not me" as an excuse then the failure is on you and it will keep happening. mr_mitm wrote 5 hours 5 min ago: A meticulous coder probably wouldn't have typed out 40 URLs just because they want to move them from one file to another. They would copy-past them and run some sed-like commands. You could instruct an LLM agent to do something similar. For modifying a lot of files or a lot of lines, I instruct them to write a script that does what I need instead of telling them to do it themselves. cimi_ wrote 7 hours 30 min ago: Your point to not rely on good intentions and have systems in place to ensure quality is good - but your comparison to humans didn't go well with me. Very few humans fill in their task with made up crap then lie about it - I haven't met any in person. And if I did, I wouldn't want to work with them, even if they work 24/7. Obligatory disclaimer for future employers: I believe in AI, I use it, yada yada. The reason I'm commenting here is I don't believe we should normalise this standard of quality for production work. exe34 wrote 8 hours 1 min ago: > that checks that links are not broken? Can you spot the next problem introduced by this? rossant wrote 8 hours 27 min ago: I agree, these kinds of stories should encourage us to setup more robust testing/backup/check strategies. Like you would absolutely have to do if you suddenly invited a bunch of inexperienced interns to edit your production code. rullelito wrote 8 hours 28 min ago: LLMs are turning into LLMs+hard-coded fixes for every imaginable problem. ivape wrote 8 hours 47 min ago: Youâre just not using LLMs enough. You can never trust the LLM to generate a url, and this was known over two years ago. It takes one token hallucination to fuck up a url. Itâs very good at a fuzzy great answer, not a precise one. You have to really use this thing all the time and pick up on stuff like that. fwip wrote 11 min ago: I think part of the issue is that it doesn't "feel" like the LLM is generating a URL, because that's not what a human would be doing. A human would be cut & pasting the URLs, or editing the code around them - not retyping them from scratch. Edit: I think I'm just regurgitating the article here. jollyllama wrote 3 hours 39 min ago: > Youâre just not using LLMs enough. > You can never trust the LLM to generate a url This is very poorly worded. Using LLMs more wouldn't solve the problem. What you're really saying is that the GP is uninformed about LLMs. This may seem like pedantry on my part but I'm sick of hearing "you're doing it wrong" when the real answer is "this tool can't do that." The former is categorically different than the latter. IanCal wrote 3 hours 17 min ago: It's pretty clearly worded to me, they don't use LLMs enough to know how to use them successfully. If you use them regularly you wouldn't see a set of urls without thinking "Unless these are extremely obvious links to major sites, I will assume each is definitely wrong". > I'm sick of hearing "you're doing it wrong" That's not what they said. They didn't say to use LLMs more for this problem. The only people that should take the wrong meaning from this are ones who didn't read past the first sentence. > when the real answer is "this tool can't do that." That is what they said. jollyllama wrote 2 hours 52 min ago: > If you use them regularly you wouldn't see a set of urls without thinking... Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". It is very similar to these very bizarre AI-maximalist positions that so many of us are tired of seeing. IanCal wrote 2 hours 43 min ago: This isn't ai maximalist though, it's explicitly pointing out something that regularly does not work! > Sure, but conceivably, you could also be informed of this second hand, through any publication about LLMs, so it is very odd to say "you don't use them enough" rather than "you're ignorant" or "you're uninformed". But this is to someone who is actively using them, and the suggestion of "if you were using them more actively you'd know this, this is a very common issue" is not at all weird. There are other ways they could have known this, but they didn't. "You haven't got the experience yet" is a much milder way of saying someone doesn't know how to use a tool properly than "you're ignorant". hansmayer wrote 8 hours 13 min ago: Yeah so, the reason people use various tools and machines in the first place is to simplify the work or everydays tasks by : 1) Making the tasks execute faster 2) Getting more reliable outputs then doing this by yourself 3) Making it repeatable . The LLMs obviously dont check any of these boxes so why don´t we stop pretending that we as users are stupid and don´t know how to use them and start taking them for what they are - cute little mirages, perhaps applicable as toys of some sort, but not something we should use for serious engineering work really? IanCal wrote 3 hours 20 min ago: They easily check a bunch of those boxes. > why don´t we stop pretending that we as users are stupid and don´t know how to use them This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts! hitarpetar wrote 22 min ago: it's amazing that you picked another dark pattern as your comparison mbesto wrote 32 min ago: > This is in response to someone who saw a bunch of URLs coming out of it and was surprised at a bunch of them being wrong. That's using the tool wrong. It's like being surprised that the top results in google/app store/play store aren't necessarily the best match for your query but actually adverts! The CEO of Anthropic said I can fire all of my developers soon. How could one possibly be using the tool wrong? /s IanCal wrote 8 min ago: If you base all your tech workings on the promises of CEOs you'll fail badly, you should not be surprised by this. hansmayer wrote 2 hours 51 min ago: The URLs being wrong in that specific case is one where they were using the "wrong tool". I can name you at least a dozen other cases from own experience, where too, they appear to be the wrong tool, for example for working with Terraform or for not exposing secrets by hardcoding them in the frontend. Et cetera. Many other people will have contributed thousands if not more similar but different cases. So what good are these tools then for really? Are we all really that stupid? Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on. I'd be happy to use them in whatever specific domain they supposedly excel at. But no-one seems to be able to identify one for sure. The problem is, the folks pushing or better said, shoving down these bullshit generators down our throats are trying to sell us the promise of an "everything oracle". What did old man Altman tell us about ChatGPT 5? PhD level tool for code generation or some similar nonsense? But it turns out it only gets one metric right each time - generating a lot of text. So, essentially, great for bullshit jobs (i count some of the IT jobs as such too), but not much more. IanCal wrote 2 hours 18 min ago: > Many of us mastered the hard problem of navigating various abstraction layers of computer over the years, only to be told, we now effing dont know how to write a few sentences in English? Come on. If you're trying to one shot stuff with a few sentences then yes you might be using these things wrong. I've seen people with PhDs fail to use google successfully to find things, were they idiots? If you're using them wrong you're using them wrong - I don't care how smart you are in other areas. If you can't hand off work knowing someones capabilities then that's a thing you can't do - and that's ok. I've known unbelievably good engineers who couldn't form a solid plan to solve a business problem or collaboratively work to get something done to save their life. Those are different skills. But gpt5-codex and sonnet 4 / 4.5 can solidly write code, gpt-5-pro with web search can really dig into things, and if you can manage what they can do you can hand off work to them. If you've only ever worked with juniors with a feeling of "they slow everything down but maybe someday they'll be as useful as me" then you're less likely to succeed at this. Let's do a quick overview of recent chats for me: * Identifying and validating a race condition in some code * Generating several approaches to a streaming issue, providing cost analyses of external services and complexity of 3 different approaches about how much they'd change the code * Identifying an async bug two good engineers couldn't find in a codebase they knew well * Finding performance issues that had gone unnoticed * Digging through synapse documentation and github issues to find a specific performance related issue * Finding the right MSC for a feature I wanted to use but didn't know existed - and then finding the github issue that explained how it was only half implemented and how to enable the experimental other part I needed * Building a bunch of UI stuff for a short term contract I needed, saving me a bunch of hours and the client money * Going through funding opportunities and matching them against a charity I want to help in my local area * Building a search integration for my local library to handle my kids reading challenge * Solving a series of VPN issues I didn't understand * Writing a lot of astro related python for an art project to cover the loss of some NASA images I used to have access to. > the folks pushing or better said If you don't want to trust them, don't. Also don't believe the anti-hype merchants who want to smugly say these tools can't do a god damn thing. They're trying to get attention as well. hansmayer wrote 1 hour 56 min ago: Again mate, stop making arrogant assumptions and read some of my previous comments. I and my team are early adopters, since about 2 years. I am even paying for premium-level service. Trust me, it sucks and under-delivers. But good for you and others who claim they are productive with it - I am sure we will see those 10x apps rolling in soon, right? It's only been like 4 years since the revolutionary magic machine was announced. IanCal wrote 1 hour 31 min ago: I read your comments. Did you read mine? You can pass them into chatgpt or claude or whatever premium services you pay for to summarise them for you if you want. > Trust me, it sucks Ok. I'm convinced. > and under-delivers. Compared to what promise? > I am sure we will see those 10x apps rolling in soon, right? Did I argue that? If you want to look at some massive improvements, I was able to put up UIs to share results & explore them with a client within minutes rather than it taking me a few hours (which from experience it would have done). > It's only been like 4 years since the revolutionary magic machine was announced. It's been less than 3 since chatgpt launched, which if you'd been in the AI sphere as long as I had (my god it's 20 years now) absolutely was revolutionary. Over the last 4 years we've seen gpt3 solve a bunch of NLP problems immediately as long as you didn't care about cost to gpt-5-pro with web search and codex/sonnet being able to explore a moderately sized codebase and make real and actual changes (running tests and following up with changes). Given how long I spent stopping a robot hitting the table because it shifted a bit and its background segmentation messed up, or fiddling with classifiers for text, the idea I can get a summary from input without training is already impressive and then to be able to say "make it less wanky" and have it remove the corp speak is a huge shift in the field. If your measure of success is "the CEOs of the biggest tech orgs say it'll do this soon and I found a problem" then you'll be permanently disappointed. It'd be like me sitting here saying mobile phones are useless because I was told how revolutionary the new chip in an iphone was in a keynote. Since you don't seem to want to read most of this, most isn't for you. The last bit is, and it's just one question: Why are you paying for something that solves literally no problems for you? grey-area wrote 8 hours 42 min ago: Or just not bother. It sounds pretty useless if it flunks on basic tasks like this. Perhaps youâve been sold a lie? seanw265 wrote 1 hour 19 min ago: I suspect you haven't tried a modern mid-to-large-LLM & Agent pair for writing code. They're quite capable, even if not suited for all tasks. IanCal wrote 3 hours 12 min ago: They're moderately unreliable text copying machines if you need exact copying of long arbitrary strings. If that's what you want, don't use LLMs. I don't think they were ever really sold as that, and we have better tools for that. On the other hand, I've had them easily build useful code, answer questions and debug issues complex enough to escape good engineers for at least several hours. Depends what you want. They're also bad (for computers) at complex arithmetic off the bat, but then again we have calculators. goalieca wrote 2 hours 15 min ago: > I don't think they were ever really sold as that, and we have better tools for that. We have OpenAI calling gpt5 as having PhD level of intelligence and others like Anthropoc saying it will write all our code within months. Some are claiming itâs already writing 70%. I say they are being sold as a magical do everything tool. IanCal wrote 36 min ago: Intelligence isn't the same as "can exactly replicate text". I'm hopefully smarter than a calculator but it's more reliable at maths than me. Also there's a huge gulf between "some people claim it can do X" and "it's useful". Altman promising something new doesn't decrease the usefulness of a model. hitarpetar wrote 20 min ago: saddest goalpost ever mbesto wrote 26 min ago: What you are describing is "dead reasoning zones".[0] "This isn't how humans work. Einstein never saw ARC grids, but he'd solve them instantly. Not because of prior knowledge, but because humans have consistent reasoning that transfers across domains. A logical economist becomes a logical programmer when they learn to code. They don't suddenly forget how to be consistent or deduce. But LLMs have "dead reasoning zones" â areas in their weights where logic doesn't work. Humans have dead knowledge zones (things we don't know), but not dead reasoning zones. Asking questions outside the training distribution is almost like an adversarial attack on the model." URI [1]: https://jeremyberman.substack.com/p/how-i-got-the-... buildbot wrote 1 hour 2 min ago: Would you hire a PhD to copy URLs by hand? Would them having PhD make it less likely theyâd make a mistake than an high school student doing the same? hitarpetar wrote 5 min ago: I would not hire anyone for a role that requires computer use who does not know how to use copy/paste parineum wrote 34 min ago: A high school student would use copy/paste and the urls would be perfect duplicates.. IanCal wrote 7 min ago: > A high school student would use copy/paste and the urls would be perfect duplicates.. Did the LLM have this? goalieca wrote 34 min ago: Grad students and even post docs often do a lot of this manual labour for data entry and formatting. Been there, done that. IanCal wrote 6 min ago: Manual data entry has lots of errors. All good workflows around this base themselves on this fact. ivape wrote 8 hours 37 min ago: Well, you see it hallucinates on long precise strings, but if we ignore that, and focus on what itâs powerful at, we can do something powerful. In this case, by the time it gets to outputting the url, it already determined the correct intent or next action (print out a url). You use this intent to do a tool call to generate a url. Small aside, itâs ability to figure what and why is pure magic, for those still peddling the glorified autocomplete narrative. You have to be able to see what this thing can actually do, as opposed to what it canât. sebtron wrote 8 hours 4 min ago: > Well, you see it hallucinates on long precise strings But all code is "long precise strings". ogogmad wrote 5 hours 12 min ago: He obviously means random unstructured strings, which code is usually not. doikor wrote 8 hours 45 min ago: I would generalise it to you canât trust LLMs to generate any kind of unique identifier. Sooner or later it will hallucinate a fake one. wat10000 wrote 1 hour 6 min ago: I would generalize it further: you can't trust LLMs. They're useful, but you must verify anything you get from them. sidgtm wrote 9 hours 4 min ago: As a UX designer I see they lack the ability of being opinionated about a design piece and go with the standard mental model. I got fed up with this and made a simple java script code to run a simple canvas on the localhost to pass on more subjective feedback using highlights and notes feature. I tried using playwright first but a. its token heavy b. it's still for finding what's working or breaking instead of thinking deeply about the design. seunosewa wrote 8 hours 15 min ago: Whatdo the notes look like? sidgtm wrote 7 hours 39 min ago: specific inputs e.g. move, color change, or giving specific inputs for interaction piece. freetonik wrote 9 hours 5 min ago: I see a pattern in these discussions all the time: some people say how very, very good LLMs are, and others say how LLMs fail miserably; almost always the first group presents examples of simple CRUD apps, frontend "represent data using some JS-framework" kind of tasks, while the second group presents examples of non-trivial refactoring, stuff like parsers (in this thread), algorithms that can't be found in leetcode, etc. Tech twitter keeps showing "one-shotting full-stack apps" or "games", and it's always something extremely banal. It's impressive that a computer can do it on its own, don't get me wrong, but it was trivial to programmers, and now it is commoditized. regularfry wrote 7 hours 45 min ago: The function of technological progress, looked at through one lens, is to commoditise what was previously bespoke. LLMs have expanded the set of repeatable things. What we're seeing is people on the one hand saying "there's huge value in reducing the cost of producing rote assets", and on the other "there is no value in trying to apply these tools to tasks that aren't repeatable". Both are right. NitpickLawyer wrote 8 hours 47 min ago: > almost always the first group presents examples of simple CRUD apps How about a full programming language written by cc "in a loop" in ~3 months? With a compiler and stuff? [1] It might be a meme project, but it's still impressive as hell we're here. I learned about this from a yt content creator that took that repo, asked cc to "make it so that variables can be emojis", and cc did that 5$ later. Pretty cool. URI [1]: https://cursed-lang.org/ Gazoche wrote 2 hours 19 min ago: > written by cc "in a loop" in ~3 months? What does that mean exactly? I assume the LLM was not left alone with its task for 3 months without human supervision. NitpickLawyer wrote 2 hours 11 min ago: From the FAQ: > the following prompt was issued into a coding agent: > Hey, can you make me a programming language like Golang but all the lexical keywords are swapped so they're Gen Z slang? > and then the coding agent was left running AFK for months in a bash loop sarchertech wrote 11 min ago: I donât buy it at all. Not even Anthropic or Open AI have come anywhere close to something like this. Running for 3 months and generating a working project this large with no human intervention is so far outside of the capabilities of any agent/LLM system demonstrated by anyone else that the mostly likely explanation is that the promoter is lying about it running on its own for 3 months. I looked through the videos listed as âfactsâ to support the claims and I donât see anything longer than a few hours. freetonik wrote 8 hours 18 min ago: Ok, not trivial for sure, but not novel? IIUC, the language does not have really new concepts, apart from the keywords (which is trivial). Impressive nonetheless. NitpickLawyer wrote 8 hours 2 min ago: Novel as in never done before? Of course not. Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. This wasn't possible 1 year ago. And this was fantasy 2.5 years ago when chatgpt launched. Also impressive in that cc "drove" this from a simple prompt. Also impressive that cc can do stuff in this 1M+ (lots of js in the extensions folders?) repo. Lots of people claim LLMs are useless in high LoC repos. The fact that cc could navigate a "new" language and make "variables as emojis" work is again novel (i.e. couldn't be done 1 year ago) and impressive. freetonik wrote 7 hours 3 min ago: >Novel as in "an LLM can maintain coherence on a 100k+ LoC project written in zig"? Yeah, that's absolutely novel in this space. Absolutely. I do not underestimate this. quietbritishjim wrote 8 hours 58 min ago: Yesterday, I got Claude Code to make a script that tried out different point clustering algorithms and visualise them. It made the odd mistake, which it then corrected with help, but broadly speaking it was amazing. It would've taken me at least a week to write by hsnd, maybe longer. It was writing the algorithms itself, definitely not just simple CRUD stuff. dncornholio wrote 1 hour 45 min ago: That's actually a very specific domain, which is well documented and researched in which LLM's will alawys do well. Shit will hit the fans quickly when you're going to do integration where it won't have a specific problem domain. fwip wrote 7 min ago: Yep - visualizing clustering algorithms is just the "CRUD app" of a different speciality. One rule of thumb I use, is if you could expect to find a student on a college campus to do a task for you, an LLM will probably be able to do a decent job. My thinking is because we have a lot of teaching resources available for how to do that task, which the training has of course ingested. an0malous wrote 2 hours 39 min ago: Letâs see the diff piva00 wrote 8 hours 36 min ago: In my experience it's been great to have LLMs for narrowly-scoped tasks, things I know how I'd implement (or at least start implementing) but that would be tedious to manually do, prompting it with increasingly higher complexity does work better than I expected for these narrow tasks. Whenever I've attempted to actually do the whole "agentic coding" by giving it a complex task, breaking it down in sub-tasks, loading up context, reworking the plan file when something goes awry, trying again, etc. it hasn't a single fucking time done the thing it was supposed to do to completion, requiring a lot of manual reviewing, backtracking, nudging, it becomes more exhausting than just doing most of the work myself, and pushing the LLM to do the tedious work. It does work sometimes to use for analysis, and asking it to suggest changes with the reasoning but not implement them, since most times when I let it try to implement its broad suggestions it went haywire, requiring me to pull back, and restart. There's a fine line to walk, and I only see comments on the extremes online, it's either "I let 80 agents running and they build my whole company's code" or "they fail miserably on every task harder than a CRUD". I tend to not believe in either extreme, at least not for the kinds of projects I work on which require more context than I could ever fit properly beforehand to these robots. freetonik wrote 8 hours 53 min ago: I also got good results for âabove CRUDâ stuff occasionally. Sorry if I wasnât clear, I meant to primarily share an observation about vastly different responses in discussions related to LLMs. I donât believe LLMs are completely useless for non-trivial stuff, nor I believe that they wonât get better. Even those two problems in the linked article: sure, those actions are inherently alien to the LLMâs structure itself, but can be solved with augmentation. ziotom78 wrote 9 hours 11 min ago: I fully resonate with point #2. A few days ago, I was stuck trying to implement some feature in a C++ library, so I used ChatGPT for brainstorming. ChatGPT proposed a few ideas, all apparently reasonable, and then it advocated for one that was presented unambiguously as the "best". After a few iterations, I realized that its solution would have required a class hierarchy where the base class contained a templated virtual function, which is not allowed in C++. I pointed this out to ChatGPT and asked it to rethink the solution; it then immediately advocated for the other approach it had initially suggested. sxp wrote 9 hours 13 min ago: Another place where LLMs have a problem is when you ask them to do something that can't be done via duct taping a bunch of Stack Overflow posts together. E.g, I've been vibe coding in Typescript on Deno recently. For various reasons, I didn't want to use the standard Express + Node stack which is what most LLMs seem to prefer for web apps. So I ran into issues with Replit and Gemini failing to handle the subtle differences between node and deno when it comes to serving HTTP requests. LLMs also have trouble figuring out that a task is impossible. I wanted boilerplate code that rendered a mesh in Three.js using GL_TRIANGLE_STRIP because I was writing a custom shader and needed to experiment with the math. But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Grok, ChatGPT, and Gemini all hallucinated a GL_TRIANGLE_STRIP rendering API rather than telling be about this and I had to Google the problem myself. It feels like current Coding LLMs are good at replacing junior engineers when it comes to shallow but broad tasks like creating UIs, modifying examples available on the web, etc. But they fail at senior-level tasks like realizing that the requirements being asked of them aren't valid and doing something that no one has done in their corpus of training data. athrowaway3z wrote 9 hours 9 min ago: >But Three.js does support GL_TRIANGLE_STRIP rendering for architectural reasons. Typo or trolling the next LLM to index HN comments? the_mitsuhiko wrote 9 hours 17 min ago: > LLMs donât copy-paste (or cut and paste) code. For instance, when you ask them to refactor a big file into smaller ones, theyâll "remember" a block or slice of code, use a delete tool on the old file, and then a write tool to spit out the extracted code from memory. There are no real cut or paste tools. Every tweak is just them emitting write commands from memory. This feels weird because, as humans, we lean on copy-paste all the time. There is not that much copy/paste that happens as part of refactoring so it leans to just using context recall. It's not entirely clear if providing an actual copy/paste command is particularly useful, at least from my testing it does not do much. More interesting are repetitive changes that clog up the context. Those you can improve on if you have `fastmod` or some similar tool available: with it you can instruct codex or claude to perform edits with it. > And itâs not just how they handle code movement -- their whole approach to problem-solving feels alien too. It is, but if you go back and forth to work out a plan for how to solve the problem, then the approach greatly changes. brianpan wrote 8 hours 54 min ago: How is it not clear that it would be beneficial? To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. But an LLM agent will take multiple minutes to do the same thing and doesn't get it right. the_mitsuhiko wrote 8 hours 35 min ago: > How is it not clear that it would be beneficial? There is reinforcement learning on the Anthropic side for a text edit tool, which is built in a way that does not lend itself to copy/paste. If you use a model like the GPT series then there might not be reinforcement learning for text editing (I believe, I don't really know), but it operates on line-based replacements for the most part and for it to understand what to manipulate it needs to know the content in the context. When you try to give it a copy/paste buffer it does not fully comprehend what the change in the file looks like after the operation. So it might be possible to do something with copy/paste, but I did not find it to be very obvious how you make that work with an agent, given that it needs to read the file into context anyways and its recall capabilities are surprisingly good. > To use another example, with my IDE I can change a signature or rename something across multiple files basically instantly. So yeah, that's the more interesting case and there things like codemod/fastmod are very effective if you tell an agent to use it. They just don't reach there. 3abiton wrote 9 hours 14 min ago: I think copy/paste can alleviate context explosion. Basically the model can remember what's the code block contain, can access it at any time, without needing to "remember" it. giancarlostoro wrote 9 hours 18 min ago: Point #2 cracks me up because I do see with JetBrains AI (no fault of JetBrains mind you) the model updates the file, and sometimes I somehow wind up with like a few build errors, or other times like 90% of the file is now build errors. Hey what? Did you not run some sort of what if? throw-10-8 wrote 9 hours 20 min ago: 3. Saying no LLMs will gladly go along with bad ideas that any reasonable dev would shoot down. pimeys wrote 9 hours 1 min ago: I've found codex to be better here than Claude. It has stopped many times and said hey you might be wrong. Of course this changes with a larger context. Claude is just chirping away "You're absolutely right" and making me to turn on caps lock when I talk to it and it's not even noon yet. throw-10-8 wrote 8 hours 55 min ago: i find the chirpy affirmative tone of claude to be rage inducing pimeys wrote 8 hours 43 min ago: This. The biggest reason I went with OpenAI this month... throw-10-8 wrote 4 hours 23 min ago: My "favorite" is when it makes a mistake and then tries gaslight you into thinking it was your mistake and then confidently presents another incorrect solution. All while having the tone of an over caffeinated intern who has only ever read medium articles. nxpnsv wrote 9 hours 4 min ago: Agree, this is really bad. throw-10-8 wrote 9 hours 1 min ago: It's a fundamental failing of trying to use a statistical approximation of human language to generate code. You can't fix it. Vipsy wrote 9 hours 20 min ago: Coding agents tend to assume that the development environment is static and predictable, but real codebases are full of subtle, moving parts - tooling versions, custom scripts, CI quirks, and non-standard file layouts. Many agents break down not because the code is too complex, but because invisible, âboringâ infrastructure details trip them up. Human developers subconsciously navigate these pitfalls using tribal memory and accumulated hacks, but agents bluff through them until confronted by an edge case. This is why even trivial tasks intermittently fail with automation agents. youâre fighting not logic errors, but mismatches with the real lived context. Upgrading this context-awareness would be a genuine step change. pimeys wrote 9 hours 5 min ago: Yep. One of the things I've found agents always having a lot of trouble with is anything related to OpenTelemetry. There's a thing you call that uses some global somewhere, there's a docker container or two and there's the timing issues. It takes multiple tries to get anything right. Of course this is hard for a human too if you haven't used otel before... tjansen wrote 9 hours 22 min ago: Agreed with the points in that article, but IMHO the no 1 issue is that agents only see a fraction of the code repository. They don't know whether there is a helper function they could use, so they re-implement it. When contributing to UIs, they can't check the whole UI to identify common design patterns, so they re-invent it. The most important task for the human using the agent is to provide the right context. "Look at this file for helper functions", "do it like that implementation", "read this doc to understand how to do it"... you can get very far with agents when you provide them with the right context. (BTW another issue is that they have problems navigating the directory structure in a large mono repo. When the agents needs to run commands like 'npm test' in a sub-directory, they almost never get it right the first time) bunderbunder wrote 1 hour 40 min ago: This is what I keep running into. Earlier this week I did a code review of about new lines of code, written using Cursor, to implement a feature from scratch, and I'd say maybe 200 of those lines were really necessary. But, y'know what? I approved it. Because hunting down the existing functions it should have used in our utility library would have taken me all day. 5 years ago I would have taken the time because a PR like that would have been submitted by a new team member who didn't know the codebase well, and helping to onboard new team members is an important part of the job. But when it's a staff engineer using Cursor to fill our codebase with bloat because that's how management decided we should work, there's no point. The LLM won't learn anything and will just do the same thing over again next week, and the staff engineer already knows better but is being paid to pretend they don't. ahi wrote 39 min ago: I really really hate code review now. My colleagues will have their LLMs generate thousands of lines of boiler plate with every pattern and abstraction under the sun. A lazy programmer use to do the bare minimum and write not enough code. That made review easy. Error handling here, duplicate code there, descriptive naming here, and so on. Now a lazy programmer generates a crap load of code cribbed from "best practice" tutorials, much of it unnecessary and irrelevant for the actual task at hand. tjansen wrote 1 hour 3 min ago: >>because that's how management decided we should work, there's no point If you are personally invested, there would be a point. At least if you plan to maintain that code for a few more years. Let's say you have a common CSS file, where you define .warning {color: red}. If you want the LLM to put out a warning and you just tell it to make it red, without pointing out that there is the .warning class, it will likely create a new CSS def for that element (or even inline it - the latest Claude Code has a tendency to do that). That's fine and will make management happy for now. But if later management decides that it wants all warning messages to be pink, it may be quite a challenge to catch every place without missing one. bunderbunder wrote 36 min ago: There really wouldn't be; it would just be spitting into the wind. What am I going to do, convince every member of my team to ignore a direct instruction from the people who sign our paychecks? hwillis wrote 3 hours 49 min ago: That's what claude.md etc are for. If you want it to follow your norms then you have to document them. tjansen wrote 1 hour 1 min ago: That's fine for norms, but I don't think you can use it to describe every single piece of your code. Every function, every type, every CSS class... ColonelPhantom wrote 1 hour 59 min ago: Well, sure, but from what I know, humans are way better at following 'implicit' instructions than LLMs. A human programmer can 'infer' most of the important basic rules from looking at the existing code, whereas all this agents.md/claude.md/whatever stuff seems necessary to even get basic performance in this regard. Also, the agents.md website seems to mostly list README.md-style 'how do I run this instructions' in its example, not stylistic guidelines. Furthermore, it would be nice if these agents add it themselves. With a human, you tell them "this is wrong, do it that way" and they would remember it. (Although this functionality seems to be worked on?) rdsubhas wrote 7 hours 24 min ago: To be fair, this is a daily life story for any senior engineer working with other engineers. Leynos wrote 7 hours 25 min ago: I wonder if a large context model could be employed here via tool call. One of the great things Gemini chat can do is ingest a whole GitHub repo. Perhaps "before implementing a new utility or helper function, ask the not-invented-here tool if it's been done already in the codebase" Of course, now I have to check if someone has done this already. bunderbunder wrote 1 hour 29 min ago: Large context models don't do a great job of consistently attending to the entire context, so it might not work out as well in practice as continuing to improve the context engineering parts of coding agents would. I'd bet that most the improvement in Copilot style tools over the past year is coming from rapid progress in context engineering techniques, and the contribution of LLMs is more modest. LLMs' native ability to independently "reason" about a large slushpile of tokens just hasn't improved enough over that same time period to account for how much better the LLM coding tools have become. It's hard to see or confirm that, though, because the only direct comparison you can make is changing your LLM selection in the current version of the tool. Plugging GPT5 into the original version of Copilot from 2021 isn't an experiment most of us are able to try. knes wrote 2 hours 9 min ago: This is what we do at Augmentcode.com. We started with building the best code retrieval and build an agent around it. 4b11b4 wrote 3 hours 48 min ago: Sure, but just bcuz it went into context doesn't mean LLM "understand" it. Also, not all sections of context iz equal. itsdavesanders wrote 4 hours 7 min ago: Claude can use use tools to do that, and some different code indexer MCPs work, but that depends on the LLM doing the coding to make the right searches to find the code. If you are in a project where your helper functions or shared libs are scattered everywhere itâs a lot harder. Just like with humans it definitely works better if you follow good naming conventions and file patterns. And even then I tend to make sure to just include the important files in the context or clue the LLM in during the prompt. It also depends on what language you use. A LOT. During the day I use LLMs with dotnet and itâs pretty rough compared to when Iâm using rails on my side projects. Dotnet requires a lot more prompting and hand holding, both due to its complexity but also due to how much more verbose it is. schiho wrote 9 hours 26 min ago: I just run into this issue with claude sonet 4.5, asked it to copy/paste some constants from one file to another, a bigger chunk of code, it instead "extracted" pieces and named them so. As a last resort, after going back and forth it agreed to do a file/copy by running a system command. I was surprised that of all the programming tasks, a copy/paste felt challenging for the agent. tjansen wrote 9 hours 19 min ago: I guess the LLMs are trained to know what finished code looks like. They don't really know the operations a human would use to get there. hu3 wrote 9 hours 26 min ago: I have seen LLMs in VSCode Copilot ask to execute 'mv oldfile.py newfile.py'. So there's hope. But often they just delete and recreate the file, indeed. AllegedAlec wrote 9 hours 27 min ago: On a more important level, I found that they still do really badly at even a minorly complex task without extreme babysitting. I wanted it to refactor a parser in a small project (2.5K lines total) because it'd gotten a bit too interconnected. It made a plan, which looked reasonable, so I told it to do this in stages, with checkpoints. It said it'd done so. I asked it "so is the old architecture also removed?" "No, it has not been removed." "Is the new structured used in place of the old one?" "No, it has not." After it did so, 80% of the test suite failed because nothing it'd written was actually right. Did so three times with increasingly more babysitting, but it failed at the abstract task of "refactor this" no matter what with pretty much the same failure mode. I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place. jansan wrote 8 hours 20 min ago: I was hoping that LLMs being able to access strict tools, like Gemini using Python libraries, would finally give reliable results. So today I asked Gemini to simplify a mathematical expression with sympy. It did and explained to me how some part of the expression could be simplified wonderfully as a product of two factors. But it was all a lie. Even though I explicitly asked it to use sympy in order to avoid such hallucinations and get results that are actually correct, it used its own flawed reasoning on top and again gave me a completely wrong result. You still cannot trust LLMs. And that is a problem. ogogmad wrote 5 hours 2 min ago: The obvious point has to be made: Generating formal proofs might be a partial fix for this. By contrast, coding is too informal for this to be as effective for it. coldtea wrote 8 hours 23 min ago: >I feel like I have to tell it exactly to make changes X and Y to class Z, remove class A etc etc, at which point I can't let it do stuff unsupervised, which is half of the reason for letting an LLM do this in the first place. The reason better turn to "It can do stuff faster than I ever could if I give it step by step high level instructions" instead. AllegedAlec wrote 7 hours 16 min ago: That would be a solution, yes. But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways. I hate this idea of "well you just need to understand all the arcane ways in which to properly use it to its proper effects". It's like a car which has a gear shifter, but that's not fully functional yet, so instead you switch gear by spelling out in morse code the gear you want to go into using L as short and R as long. Furthermore, you shouldn't try to listen to 105-112 on the FM band on the radio, because those frequencies are used to control the brakes and ABS and if you listen to those frequencies the brakes no longer work. We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM. coldtea wrote 6 hours 16 min ago: >But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways. Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :) >We would rightfully stone any engineer who'd design this and then say "well obvious user error" when the user rightfully complains that they crash whenever they listen to Arrow FM. We might curse the company and engineer who did it, but we would still use that car and do those workarounds, if doing so allowed us to get to our destination in 1/10 the regular time... AllegedAlec wrote 3 hours 5 min ago: > >But currently it feels extremely borked from a UX perspective. It purports to be able to do this, but when you tell it to it breaks in unintuitive ways. > Thankfully as programmers we know better and don't need to care what the UI pretends to be able to do :) But we do though. You can't just say "yeah they left all the foot guns in but we ought to know not to use them", especially not when the industry shills tell you those footguns are actually rocket boosters to get you to the fucking moon and back. jeswin wrote 8 hours 33 min ago: > I wanted it to refactor a parser in a small project This expression tree parser (typescript to sql query builder - [1] ) has zero lines of hand-written code. It was made with Codex + Claude over two weeks (part-time on the side). Having worked on ORMs previously, it would have taken me 4x-10x the time to get to the same state (which also has 100s of tests, with some repetitions). That's a massive saving in time. I did not have to baby sit the LLMs at all. So the answer is, I think it depends on what you use it for, and how you use it. Like every tool, it takes a really long time to find a process that works for you. In my conversations with other developers who use LLMs extensively, they all have their unique, custom workflows. All of them however do focus on test suites, documentation, and method review processes. URI [1]: https://tinqerjs.org/ AllegedAlec wrote 7 hours 55 min ago: I have tried several. Overall I've now set on strict TDD (which it still seems to not do unless I explicitly tell it to even though I have it as a hard requirement in claude.md). jeswin wrote 7 hours 21 min ago: Claude forgets claude.md after a while, so you need to keep reminding. I find that codex does a design job better than Claude at the moment, but it's 3x slower which I don't mind. iLoveOncall wrote 8 hours 21 min ago: Hum yeah, it shows. Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here. pprotas wrote 7 hours 24 min ago: I guess the interesting question is whether @jeswin could have created this project at all if AI tools were not involved. And if yes, would the quality even be better? iLoveOncall wrote 3 hours 32 min ago: Actually the interesting question is whether this library not existing would have been a loss for humanity. I'll posit that it would not. jeswin wrote 7 hours 18 min ago: Very true. However, to claim that the "API looks completely different for Postgre and SQLite" is disingenuous. What was he looking at? tom_ wrote 2 hours 13 min ago: There are two examples on the landing page, and they both look quite different. Surely if the API is the same for both, there'd be just one example that covers both cases, or two examples would be deliberately made as identical as possible? (Like, just a different new somewhere, or different import directive at the top, and everything else exactly the same?) I think that's the point. Perhaps experienced users of relevant technologies will just be able to automatically figure this stuff out, but this is a general discussion - people not terribly familiar with any of them, but curious about what a big pile of AI code might actually look like, could get the wrong impression. jeswin wrote 5 min ago: If you're mentioning the first two examples, they're doing different things. The pg example does an orderby, and the sqlite example does a join. You'll be able to switch the client (ie, better-sqlite and pg-promise) in either statement, and the same query would work on the other database. Maybe I should use the same example repeated for clarity. Let me do that. jeswin wrote 7 hours 50 min ago: > Just the fact that the API looks completely different for Postgre and SQLite tells us everything we need to know about the quality of the project here. How does the API look completely different for pg and sqlite? Can you share an example? It's an implementation of LINQ's IQueryable. With some bells missing in DotNet's Queryable, like Window functions (RANK queries etc) which I find quite useful. Add: What you've mentioned is largely incorrect. But in any case, it is a query builder. Meaning, an ORM like database abstraction is not the goal. This allows us to support pg's extensions, which aren't applicable to other database. habibur wrote 9 hours 0 min ago: Might be related with what the article was talking. AI can't cut-paste. It deletes the code and then regenerates it at another location instead of cut-paste. Obviously generated code drift a little from deleted ones. hu3 wrote 9 hours 24 min ago: Interesting. What model and tool was used? I have seen similar failure modes in Cursor and VSCode Copilot (using gpt5) where I have to babysit relatively small refactors. AllegedAlec wrote 9 hours 21 min ago: Claude code. Whichever model it started up automatically last weekend, I didn't explicitly check. rglynn wrote 9 hours 17 min ago: This feels like a classic Sonnet issue. From my experience, Opus or GPT-5-high are less likely to do the "narrow instruction following without making sensible wider decisions based on context" than Sonnet. coldtea wrote 8 hours 22 min ago: This is "just use another Linux distro" all over again rglynn wrote 4 hours 4 min ago: Yes and no, it's a fair criticism to some extent. Inasamuch as I would agree that different models of the same type have superficial differences. However, I also think that models which focus on higher reasoning effort in general are better at taking into account the wider context and not missing obvious implications from instructions. Non-reasoning or low-reasoning models serve a purpose, but to suggest they are akin to different flavours misses what is actually quite an important distinction. koliber wrote 9 hours 31 min ago: Most developers are also bad at asking questions. They tend to assume too many things from the start. In my 25 years of software development I could apply the second critique to over half of the developers I knew. That includes myself for about half of that career. rkomorn wrote 9 hours 28 min ago: But, just like lots of people expect/want self-driving to outperform humans even on edge cases in order to trust them, they also want "AI" to outperform humans in order to trust it. So: "humans are bad at this too" doesn't have much weight (for people with that mindset). It makes sense to me, at least. darkwater wrote 9 hours 22 min ago: If we had a knife that most of the time cuts a slice of bread like the bottom p50 of humans cutting a slice of bread with their hands, we wouldn't call the knife useful. Ok, this example is probably too extreme, replace the knife with an industrial machine that cut bread vs a human with a knife. Nobody would buy that machine either if it worked like that. koliber wrote 8 hours 46 min ago: Agreed in a general sense, but there's a bit more nuance. If a knife slices bread like a normal human at p50, it's not a very good knife. If a knife slices bread like a professional chef at p50, it's probably a very decent knife. I don't know if LLMs are better at asking questions than a p50 developer. In my original comment I wanted to raise the question of whether the fact that LLMs are not good at asking questions makes them still worse than human devs. The first LLM critique in the original article is that they can't copy and paste. I can't argue with that. My 12 year old copies-and-pastes better than top coding agents. The second critique says they can't ask questions. Since many developers also are not good at this, how does the current state of the art LLM compare to a p50 developer in this regard? Certhas wrote 8 hours 50 min ago: I think this is still too extreme. A machine that cuts and preps food at the same level as a 25th percentile person _being paid to do so_, while also being significantly cheaper would presumably be highly relevant. rkomorn wrote 8 hours 40 min ago: Aw man. There are so many angles though. Your p25 employee is probably much closer to your p95 employee than to the p50 "standard" human, so yeah, I think you have a point there. But at least in food prep, p25 would already be pretty damn hard to achieve. That's a hell of a lot of autonomy and accuracy (at least in my restaurant kitchen experience which is admittedly just one year in "fine dining"-ish kitchens). I'd say the p25 of software or SRE folks I've worked with is also a pretty high bar to hit, too, but maybe I've been lucky. rkomorn wrote 9 hours 15 min ago: I feel kind of attacked for my sub-p50 bread slicing skills, TBH. :( rconti wrote 9 hours 31 min ago: Doing hard things that aren't greenfield? Basically any difficult and slightly obscure question I get stuck with and hope the collective wisdom of the internet can solve? athrowaway3z wrote 8 hours 56 min ago: You don't learn new languages/paradigms/frameworks by inserting it into an existing project. LLMs are especially tricky because they do appear to work magic on a small greenfield, and the majority of people are doing clown-engineering. But I think some people are underestimating what can be done in larger projects if you do everything right (eg docs, tests, comments, tools) and take time to plan. nikanj wrote 9 hours 31 min ago: 4/5 times when Claude is looking for a file, it starts by running bash(dir c:\test /b) First it gets an error because bash doesnât understand \ Then it gets an error because /b doesnât work And as LLMs donât learn from their mistakes, it always spends at least half a dozen tries (e.g. bash(cmd.exe /c dir c:\test /b )) before it figures out how to list files If it was an actual coworker, weâd send it off to HR cheema33 wrote 8 hours 52 min ago: Most models struggle in a Windows environment. They are trained on a lot of Unixy commands and not as much on Windows and PowerShell commands. It was frustrating enough that I started using WSL for development when using Windows. That helped me significantly. I am guessing this because: 1. Most of the training material online references Unix commands. 2. Most Windows devs are used to GUIs for development using Visual Studio etc. GUIs are not as easy to train on. Side note: Interesting thing I have noticed in my own org is that devs with Windows background strictly use GUIs for git. The rest are comfortable with using git from the command line. anonzzzies wrote 9 hours 21 min ago: I have a list of those things in CLAUDE.md -> it seems to help (unless it's context is full, but you should never let it get close really). ra wrote 9 hours 32 min ago: IaC, and DSLs in general. IanCal wrote 9 hours 43 min ago: Editing tools are easy to add itâs just you have to pick what things to give them because too many and they struggle as it uses up a lot of context. Still, as costs come down multiple steps to look for tools becomes cheaper too. Iâd like to see what happens with better refactoring tools, Iâd make a bunch more mistakes copying and retyping or using awk. If they want to rename something they should be able to use the same tooling the rest of us get. Asking questions is a good point but thatâs both a bit of promoting and I think the move to having more parallel work makes it less relevant. One of the reasons clarifying things more upfront is useful is we take a lot of time and cost a lot of money to build things so the economics favours getting it right first time. As the time comes down and the cost drops to near zero, the balance changes. There are also other approaches to clarify more what you want and how to do it first, breaking that down into tasks, then letting it run with those (spec kit). This is an interesting area. baq wrote 9 hours 46 min ago: they're getting better at asking questions; I routinely see search calls against the code base index. they just don't ask me questions. davydm wrote 11 hours 23 min ago: Coding and...? Black616Angel wrote 9 hours 48 min ago: Copy and pasting. Oh, sorry. You already said that. :D drdeca wrote 9 hours 50 min ago: More granular. What things is it bad at that result in it being overall âbad at codingâ? It isnât all of the parts. DIR <- back to front page